Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help

Open Tabs

  • 08-visualization-plotly.ipynb
  • 09-visualization-seaborn.ipynb
  • 10-databases-sql.ipynb
  • 11-databases-mongodb.ipynb
  • 12-ml-core.ipynb
  • 13-ml-data-pre-processing-and-production.ipynb
  • 14-ml-classification.ipynb
  • 15-ml-regression.ipynb
  • 16-ml-unsupervised-learning.ipynb
  • 17-ts-core.ipynb
  • 18-ts-models.ipynb
  • 19-linux-command-line.ipynb
  • 20-statistics.ipynb
  • 21-python-object-oriented-programming.ipynb
  • 22-apis.ipynb
  • jovyan@user-6382c061478c7b001c196a81: ~/work/ds-curriculum/@textbook
  • main.py

Kernels

  • 10-databases-sql.ipynb
  • 11-databases-mongodb.ipynb
  • 15-ml-regression.ipynb
  • 13-ml-data-pre-processing-and-production.ipynb
  • 12-ml-core.ipynb
  • 17-ts-core.ipynb
  • 09-visualization-seaborn.ipynb
  • 14-ml-classification.ipynb
  • 16-ml-unsupervised-learning.ipynb
  • 08-visualization-plotly.ipynb
  • 18-ts-models.ipynb
  • 19-linux-command-line.ipynb
  • 20-statistics.ipynb
  • 21-python-object-oriented-programming.ipynb
  • 22-apis.ipynb

Terminals

  • terminals/1
//ds-curriculum/@textbook/
Name
...
Last Modified
  • .ipynb_checkpoints39 minutes ago
  • data2 months ago
  • 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynba day ago
  • 01-python-getting-started.ipynb2 months ago
  • 02-python-advanced.ipynba day ago
  • 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb2 months ago
  • 03-pandas-getting-started.ipynb2 months ago
  • 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynba day ago
  • 04-pandas-advanced.ipynb2 months ago
  • 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynba day ago
  • 05-pandas-summary-statistics.ipynb2 months ago
  • 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynba day ago
  • 06-visualization-matplotlib.ipynb2 months ago
  • 07-visualization-pandas.ipynba day ago
  • 08-visualization-plotly.ipynba day ago
  • 09-visualization-seaborn.ipynba day ago
  • 10-databases-sql.ipynb2 months ago
  • 11-databases-mongodb.ipynba day ago
  • 12-ml-core.ipynb2 months ago
  • 13-ml-data-pre-processing-and-production.ipynba day ago
  • 14-ml-classification.ipynba day ago
  • 15-ml-regression.ipynb5 hours ago
  • 16-ml-unsupervised-learning.ipynb3 hours ago
  • 17-ts-core.ipynban hour ago
  • 18-ts-models.ipynban hour ago
  • 19-linux-command-line.ipynban hour ago
  • 20-statistics.ipynb43 minutes ago
  • 21-python-object-oriented-programming.ipynb40 minutes ago
  • 22-apis.ipynb13 minutes ago
  • main.py3 months ago
​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

<font size="+3"><strong>Databases: PyMongo</strong></font>

Databases: PyMongo

# Working with PyMongo

Working with PyMongo¶

For all of these examples, we're going to be working with the "lagos" collection in the "air-quality" database. Before we can do anything else, we need to bring in pandas (which we won't use until the very end), pprint (a module that lets us see the data in an understandable way), and PyMongo (a library for working with MongoDB databases).

[1]:
from pprint import PrettyPrinter
## Databases

Databases¶

Data comes to us in lots of different ways, and one of those ways is in a database. A database is a collection of data.

## Servers and Clients

Servers and Clients¶

Next, we need to connect to a server. A database server is where the database resides; it can be accessed using a client. Without a client, a database is just a collection of information that we can't work with, because we have no way in. We're going to be learning more about a database called MongoDB, and we'll use PrettyPrinter to make the information it generates easier to understand. Here's how the connection works:

[2]:
client = MongoClient(host="localhost", port=27017)
## Semi-structured Data

Semi-structured Data¶

Databases are designed to work with either structured data or semi-structured data. In this part of the course, we're going to be working with databases that contain semi-structured data. Data is semi-structured when it has some kind of organizing logic, but that logic can't be displayed using rows and columns. Your email account contains semi-structured data if it’s divided into sections like Inbox, Sent, and Trash. If you’ve ever seen tweets from Twitter grouped by hashtag, that’s semi-structured data too. Semi-structured data is also used in sensor readings, which is what we'll be working with here.

## Exploring a Database

Exploring a Database¶

So, now that we're connected to a server, let's take a look at what's there. Working our way down the specificity scale, the first thing we need to do is figure out which databases are on this server. To see which databases on the server, we'll use the list_databases method, like this:

[3]:
pp.pprint(list(client.list_databases()))
[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
  {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
  {'empty': False, 'name': 'config', 'sizeOnDisk': 12288},
  {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
  {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
It looks like this server contains four databases: `"admin"`, `"air-quality"`, `"config"`, and `"local"`. We're only interested in `"air-quality"`, so let's connect to that one:

It looks like this server contains four databases: "admin", "air-quality", "config", and "local". We're only interested in "air-quality", so let's connect to that one:

[4]:
db = client["air-quality"]
In MongoDB, a **database** is a container for **collections**. Each database gets its own set of files, and a single MongoDB **server** typically has multiple databases.

In MongoDB, a database is a container for collections. Each database gets its own set of files, and a single MongoDB server typically has multiple databases.

## Collections

Collections¶

Let's use a for loop to take a look at the collections in the "air-quality" database:

[5]:
for c in db.list_collections():
system.views
lagos
system.buckets.lagos
nairobi
system.buckets.nairobi
dar-es-salaam
system.buckets.dar-es-salaam
As you can see, there are three actual collections here: `"nairobi"`, `"lagos"`, and `"dar-es-salaam"`. Since we're only interested in the `"lagos"` collection, let's get it on its own like this: 

As you can see, there are three actual collections here: "nairobi", "lagos", and "dar-es-salaam". Since we're only interested in the "lagos" collection, let's get it on its own like this:

[6]:
lagos = db["lagos"]
## Documents

Documents¶

A MongoDB **document** is an individual record of data in a **collection**, and is the basic unit of analysis in MongoDB. Documents come with **metadata** that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the [`count_documents`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.count_documents) method to see how many documents the `"lagos"` collection contains:

A MongoDB document is an individual record of data in a collection, and is the basic unit of analysis in MongoDB. Documents come with metadata that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the count_documents method to see how many documents the "lagos" collection contains:

[7]:
lagos.count_documents({})
[7]:
166496
<font size="+1">Practice</font>

Practice

Try it yourself! Bring in all the necessary libraries and modules, then connect to the "air-quality" database and print the number of documents in the "nairobi" collection.

[8]:
client = MongoClient(host = "localhost", port = 27017)
[    {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
     {'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
     {'empty': False, 'name': 'config', 'sizeOnDisk': 12288},
     {'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
     {'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
system.views
lagos
system.buckets.lagos
nairobi
system.buckets.nairobi
dar-es-salaam
system.buckets.dar-es-salaam
[8]:
202212
### Retrieving Data

Retrieving Data¶

Now that we know how many documents the "lagos" collection contains, let's take a closer look at what's there. The first thing you'll notice is that the output starts out with a curly bracket ({), and ends with a curly bracket (}). That tells us that this information is a dictionary. To access documents in the collection, we'll use two methods: find and find_one. As you might expect, find will retrieve all the documents, and find_one will bring back only the first document. For now, let's stick to find_one; we'll come back to find later.

Just like everywhere else, we'll need to assign a variable name to whatever comes back, so let's call this one result.

[9]:
result = lagos.find_one({})
{    '_id': ObjectId('6334b0f18c51459f9b1d955d'),
     'metadata': {    'lat': 6.501,
                      'lon': 3.367,
                      'measurement': 'temperature',
                      'sensor_id': 10,
                      'sensor_type': 'DHT11',
                      'site': 4},
     'temperature': nan,
     'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 88000)}
### Key-Value Pairs

Key-Value Pairs¶

There's a lot going on here! Let's work from the bottom up, starting with this:

{
    'temperature': 27.0,
    'timestamp': datetime.datetime(2017, 9, 6, 13, 18, 10, 120000)
}

The actual data is labeled temperature and timestamp, and if seeing it presented this way seems familiar, that's because what you're seeing at the bottom are two key-value pairs. In PyMongo, "_id" is always the primary key. Primary keys are the column(s) which contain values that uniquely identify each row in a table; we'll talk about that more in a minute.

### Metadata

Metadata¶

Next, we have this:

'metadata': { 'lat': 6.602,
              'lon': 3.351,
              'measurement': 'temperature',
              'sensor_id': 9,
              'sensor_type': 'DHT11',
              'site': 2}

This is the document's metadata. Metadata is data about the data. If you’re working with a database, its data is the information it contains, and its metadata describes what that information is. In MongoDB, each document often has metadata of its own. If we go back to the example of your email account, each message in your Sent folder includes both the message itself and information about when you sent it and who you sent it to; the message is data, and the other information is metadata.

The metadata we see in this block of code tells us what the key-value pairs from the last code block mean, and where the information stored there comes from. There's location data, a line telling us what about the format of the key-value pairs, some information about the equipment used to gather the data, and where the data came from.

### Identifiers

Identifiers¶

Finally, at the top, we have this:

{ 
    '_id': ObjectId('6126f1780e45360640bf240a')
}

This is the document's unique identifier, which is similar to the index label for each row in a pandas DataFrame.

<font size="+1">Practice</font>

Practice

Try it yourself! Retrieve a single document from the "nairobi" collection, and print the result.

[10]:
result = nairobi.find_one({})
{    'P1': 39.67,
     '_id': ObjectId('6334b0e98c51459f9b198d27'),
     'metadata': {    'lat': -1.3,
                      'lon': 36.785,
                      'measurement': 'P1',
                      'sensor_id': 57,
                      'sensor_type': 'SDS011',
                      'site': 29},
     'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
## Analyzing Data

Analyzing Data¶

Now that we've seen what a document looks like in this collection, let's start working with what we've got. Since our metadata includes information about each sensor's "site", we might be curious to know how many sites are in the "lagos" collection. To do that, we'll use the distinct method, like this:

[11]:
lagos.distinct("metadata.site")
[11]:
[3, 4]
Notice that in order to grab the `"site"` number, we needed to include the `"metadata"` tag. 

Notice that in order to grab the "site" number, we needed to include the "metadata" tag.

This tells us that there are 2 sensor sites in Lagos: one labeled 3 and the other labeled 4.

Let's go further. We know that there are two sensor sites in Lagos, but we don't know how many documents are associated with each site. To find that out, we'll use the count_documents method for each site.

[12]:
print("Documents from site 3:", lagos.count_documents({"metadata.site": 3}))
Documents from site 3: 140586
Documents from site 4: 25910
<font size="+1">Practice</font>

Practice

Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, and how many documents are associated with each one.

[13]:
print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))
Documents from site 29: 131852
Documents from site 6: 70360
[14]:
print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))
Documents from site 29: 131852
Documents from site 6: 70360
Now that we know how many *documents* are associated with each site, let's keep drilling down and find the number of *readings* for each site. We'll do this with the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method.

Now that we know how many documents are associated with each site, let's keep drilling down and find the number of readings for each site. We'll do this with the aggregate method.

Before we run it, let's take a look at some of what's happening in the code here. First, you'll notice that there are several dollar signs ($) in the list. This is telling the collection that we want to create something new. Here, we're saying that we want there to be a new group, and that the new group needs to be updated with data from metadata.site, and then updated again with data from count.

There's also a new field: "_id". In PyMongo, "_id" is always the primary key. Primary keys are the fields which contain values that uniquely identify each row in a table.

Let's run the code and see what happens:

[15]:
    [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
[{'_id': 3, 'count': 140586}, {'_id': 4, 'count': 25910}]
With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the `P2` values in the `"lagos"` collection. `P2` measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the [`find`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.find) method and use `limit` to make sure we only get back the first 3. 

With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the P2 values in the "lagos" collection. P2 measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the find method and use limit to make sure we only get back the first 3.

[16]:
result = lagos.find({"metadata.measurement": "P2"}).limit(3)
[    {    'P2': 14.42,
          '_id': ObjectId('6334b0f28c51459f9b1de145'),
          'metadata': {    'lat': 6.501,
                           'lon': 3.367,
                           'measurement': 'P2',
                           'sensor_id': 6,
                           'sensor_type': 'PPD42NS',
                           'site': 4},
          'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
     {    'P2': 19.66,
          '_id': ObjectId('6334b0f28c51459f9b1de146'),
          'metadata': {    'lat': 6.501,
                           'lon': 3.367,
                           'measurement': 'P2',
                           'sensor_id': 6,
                           'sensor_type': 'PPD42NS',
                           'site': 4},
          'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
     {    'P2': 24.79,
          '_id': ObjectId('6334b0f28c51459f9b1de147'),
          'metadata': {    'lat': 6.501,
                           'lon': 3.367,
                           'measurement': 'P2',
                           'sensor_id': 6,
                           'sensor_type': 'PPD42NS',
                           'site': 4},
          'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
<font size="+1">Practice</font>

Practice

Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, how many documents are associated with each one, and the number of observations from each site. Then, return the first three documents with the value P2.

[17]:
    [{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}]
[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
[    {    'P2': 14.42,
          '_id': ObjectId('6334b0f28c51459f9b1de145'),
          'metadata': {    'lat': 6.501,
                           'lon': 3.367,
                           'measurement': 'P2',
                           'sensor_id': 6,
                           'sensor_type': 'PPD42NS',
                           'site': 4},
          'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
     {    'P2': 19.66,
          '_id': ObjectId('6334b0f28c51459f9b1de146'),
          'metadata': {    'lat': 6.501,
                           'lon': 3.367,
                           'measurement': 'P2',
                           'sensor_id': 6,
                           'sensor_type': 'PPD42NS',
                           'site': 4},
          'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
     {    'P2': 24.79,
          '_id': ObjectId('6334b0f28c51459f9b1de147'),
          'metadata': {    'lat': 6.501,
                           'lon': 3.367,
                           'measurement': 'P2',
                           'sensor_id': 6,
                           'sensor_type': 'PPD42NS',
                           'site': 4},
          'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using `distinct` to remind ourselves of the kinds of data we have at our disposal.

So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using distinct to remind ourselves of the kinds of data we have at our disposal.

[18]:
lagos.distinct("metadata.measurement")
[18]:
['humidity', 'temperature', 'P1', 'P2']
There are also comparison query operators that can be helpful for filtering the data. In total, we have 

There are also comparison query operators that can be helpful for filtering the data. In total, we have

  • $gt: greater than (>)
  • $lt: less than (<)
  • $gte: greater than equal to (>=)
  • $lte: less than equal to (<= )

Let's use the timestamp to see how we can use these operators to select different documents:

[19]:
result = nairobi.find({"timestamp": {"$gt": datetime.datetime(2018, 9, 1)}}).limit(3)
[    {    'P1': 39.67,
          '_id': ObjectId('6334b0e98c51459f9b198d27'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
     {    'P1': 39.13,
          '_id': ObjectId('6334b0e98c51459f9b198d28'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
     {    'P1': 30.07,
          '_id': ObjectId('6334b0e98c51459f9b198d29'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
[20]:
result = nairobi.find({"timestamp": {"$lt": datetime.datetime(2018, 12, 1)}}).limit(3)
[    {    'P1': 39.67,
          '_id': ObjectId('6334b0e98c51459f9b198d27'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
     {    'P1': 39.13,
          '_id': ObjectId('6334b0e98c51459f9b198d28'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
     {    'P1': 30.07,
          '_id': ObjectId('6334b0e98c51459f9b198d29'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
[21]:
    {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
[    {    'P1': 39.67,
          '_id': ObjectId('6334b0e98c51459f9b198d27'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
     {    'P2': 34.43,
          '_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P2',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
<font size="+1">Practice</font>

Practice

Try it yourself! Find three documents with timestamp greater than or equal to and less than or equal the date December 12, 2018 — datetime.datetime(2018, 12, 1, 0, 0, 6, 767000).

[22]:
result = nairobi.find({"timestamp":{"$gte":datetime.datetime(2018,12,1,0,0,6,767000)}}).limit(3)
[    {    'P1': 17.08,
          '_id': ObjectId('6334b0e98c51459f9b19eba8'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 12, 1, 0, 0, 6, 767000)},
     {    'P1': 17.62,
          '_id': ObjectId('6334b0e98c51459f9b19eba9'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 12, 1, 0, 5, 6, 327000)},
     {    'P1': 11.05,
          '_id': ObjectId('6334b0e98c51459f9b19ebaa'),
          'metadata': {    'lat': -1.3,
                           'lon': 36.785,
                           'measurement': 'P1',
                           'sensor_id': 57,
                           'sensor_type': 'SDS011',
                           'site': 29},
          'timestamp': datetime.datetime(2018, 12, 1, 0, 10, 5, 579000)}]
[23]:
# Less than or equal to
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [23], line 5
      1 # Less than or equal to
      3 result = ...
----> 5 pp.pprint(list(result))

TypeError: 'ellipsis' object is not iterable
## Updating Documents

Updating Documents¶

We can also update documents by passing some filter and new values using `update_one` to update one record or `update_many` to update many records. Let's look at an example. Before updating, we have this record showing like this:

We can also update documents by passing some filter and new values using update_one to update one record or update_many to update many records. Let's look at an example. Before updating, we have this record showing like this:

[ ]:
    {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
Now we are updating the sensor type from `"SDS011"` to `"SDS"`, we first select all records with sensor type equal to `"SDS011"`, then set the new value to `"SDS"`:

Now we are updating the sensor type from "SDS011" to "SDS", we first select all records with sensor type equal to "SDS011", then set the new value to "SDS":

[ ]:
    {"metadata.sensor_type": {"$eq": "SDS101"}},
Now we can see all records have changed:

Now we can see all records have changed:

[ ]:
    {"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}
We can change it back:

We can change it back:

[ ]:
    {"$set": {"metadata.sensor_type": "SDS101"}},
[ ]:
result.raw_result
## Aggregation

Aggregation¶

Since we're looking for *big* numbers, we need to figure out which one of those dimensions has the largest number of measurements by **aggregating** the data in each document. Since we already know that `site 3` has significantly more documents than `site 2`, let's start looking at `site 3`. We can use the `$match` syntax to only select `site 3` data. The code to do that looks like this: 

Since we're looking for big numbers, we need to figure out which one of those dimensions has the largest number of measurements by aggregating the data in each document. Since we already know that site 3 has significantly more documents than site 2, let's start looking at site 3. We can use the $match syntax to only select site 3 data. The code to do that looks like this:

[ ]:
        {"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}},
<font size="+1">Practice</font>

Practice

Try it yourself! Find the number of each measurement type at site 29 in Nairobi.

[ ]:
pp.pprint(list(result))
After aggregation, there is another useful operator called `$project`, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

After aggregation, there is another useful operator called $project, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:WQU WorldQuant University Applied Data Science Lab QQQQ

[ ]:
        {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the `count` filed by setting it at 0 in `$project`:

We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the count filed by setting it at 0 in $project:

[ ]:
        {"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},
The `$project` syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date. 

The $project syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date.

[ ]:
                "date_min": {"$min": "$timestamp"},
Then we can calculate the date difference using `$dateDiff`, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a `"dateDiff"` field inside `$project`, so that it will be shown in the final display: 

Then we can calculate the date difference using $dateDiff, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a "dateDiff" field inside $project, so that it will be shown in the final display:

[ ]:
                "date_min": {"$min": "$timestamp"},
If we specify unit as `day`, it will show the difference between the dates:

If we specify unit as day, it will show the difference between the dates:

[ ]:
                "date_min": {"$min": "$timestamp"},
<font size="+1">Practice</font>

Practice

Try it yourself find the date difference for each measurement type at site 29 in Nairobi.

[ ]:
pp.pprint(list(result))
We can do more with the date data using `$dateTrunc`, which truncates datetime data. We need to specify the datetime data, which can be a `Date`, a `Timestamp`, or an `ObjectID`. Then we need to specify the `unit` (year, month, day, hour, minute, second) and `binSize` (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using `$dateTrunc` and then count how many observations there are for each month.

We can do more with the date data using $dateTrunc, which truncates datetime data. We need to specify the datetime data, which can be a Date, a Timestamp, or an ObjectID. Then we need to specify the unit (year, month, day, hour, minute, second) and binSize (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using $dateTrunc and then count how many observations there are for each month.

[ ]:
                            "date": "$timestamp",
<font size="+1">Practice</font>

Practice

Try it yourself! Truncate date by week and count at site 29 in Nairobi.

[ ]:
pp.pprint(list(result))
## Finishing Up

Finishing Up¶

So far, we've connected to a server, accessed that server with a client, found the collection we were looking for within a database, and explored that collection in all sorts of different ways. Now it's time to get the data we'll actually need to build a model, and store that in a way we'll be able to use.

Let's use find to retrieve the PM 2.5 data from site 3. And, since we don't need any of the metadata to build our model, let's strip that out using the projection argument. In this case, we're telling the collection that we only want to see "timestamp" and "P2". Keep in mind that we limited the number of records we'll get back to 3 when we defined result above.

[ ]:
    # `projection` limits the kinds of data we'll get back.
Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set `timestamp` as the index:

Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set timestamp as the index:

[ ]:
df = pd.DataFrame(result).set_index("timestamp")
<font size="+1">Practice</font>

Practice

Try it yourself! Retrieve the PM 2.5 data from site 29 in Nairobi and strip out the metadata to create a DataFrame that shows only timestamp and P2. Print the result.

[ ]:
result = ...
# References & Further Reading

References & Further Reading¶

  • Further reading about servers and clients
  • Definitions from the MongoDB documentation
  • Information on Iterators
  • MongoDB documentation in Aggregation
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

<font size="+3"><strong>Databases: SQL</strong></font>

Databases: SQL

[ ]:
from IPython.display import YouTubeVideo
# Working with SQL Databases

Working with SQL Databases¶

A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a **relational database**. A relational database is a widely used database model that consists of a collection of uniquely named **tables** used to store information. The structure of a database model with its tables, constraints, and relationships is called a **schema**. 

A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a relational database. A relational database is a widely used database model that consists of a collection of uniquely named tables used to store information. The structure of a database model with its tables, constraints, and relationships is called a schema.

A Structured Query Language (SQL), is used to retrieve information from a relational database. SQL is one of the most commonly used database languages. It allows data stored in a relational database to be queried, modified, and manipulated easily with basic commands. SQL powers database engines like MySQL, SQL Server, SQLite, and PostgreSQL. The examples and projects in this course will use SQLite.

A table refers to a collection of rows and columns in a relational database. When reading data into a pandas DataFrame, an index can be defined, which acts as the label for every row in the DataFrame.

# Connecting to a Database

Connecting to a Database¶

## ipython-sql 

ipython-sql¶

### Magic Commands

Magic Commands¶

Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:

Magic Command Description of Command
%pwd Print the current working directory
%cd Change the current working directory
%ls List the contents of the current directory
%history Show the history of the In [ ]: commands

We will be leveraging magic commands to work with a SQLite database.

### ipython-sql

ipython-sql¶

`ipython-sql` allows you to write SQL code directly in a Jupyter Notebook. The `%sql` (or `%%sql`) magic command is added to the beginning of a code block and then SQL code can be written.

ipython-sql allows you to write SQL code directly in a Jupyter Notebook. The %sql (or %%sql) magic command is added to the beginning of a code block and then SQL code can be written.

### Connecting with ipython-sql

Connecting with ipython-sql¶

We can connect to a database using the %sql magic function:

We can connect to a database using the %sql magic function:

[ ]:
%sql sqlite:////home/jovyan/nepal.sqlite
## sqlite3

sqlite3¶

We can also connect to the same database using the sqlite3 package:

We can also connect to the same database using the sqlite3 package:

[ ]:
conn = sqlite3.connect("/home/jovyan/nepal.sqlite")
# Querying a Database

Querying a Database¶

## Building Blocks of the Basic Query

Building Blocks of the Basic Query¶

There are six common clauses used for querying data:

There are six common clauses used for querying data:

Clause Name Definition
SELECT Determines which columns to include in the query's result
FROM Identifies the table from which to query the data from
WHERE filters data
GROUP BY groups rows by common values in columns
HAVING filters out unwanted groups from GROUP BY
ORDER BY Orders the rows using one or more columns
LIMIT Outputs the specified number of rows

All clauses may be used together, but SELECT and FROM are the only required clauses. The format of clauses is in the example query below:

SELECT column1, column2
FROM table_name
WHERE "conditions"
GROUP BY "column-list"
HAVING "conditions"
ORDER BY "column-list"
## SELECT and FROM

SELECT and FROM¶

You can use `SELECT *` to select all columns in a table. `FROM` specifies the table in the database to query. `LIMIT 5` will select only the first five rows. 

You can use SELECT * to select all columns in a table. FROM specifies the table in the database to query. LIMIT 5 will select only the first five rows.

Example

[ ]:
FROM id_map
You can also use `SELECT` to select certain columns in a table

You can also use SELECT to select certain columns in a table

[ ]:
SELECT household_id,
<font size="+1">Practice</font>

Practice

Try it yourself! Use SELECT to select the district_id column from the id_map table.

[ ]:
%%sql
We can also assign an **alias** or temporary name to a column using the `AS` command. Aliases can also be used on a table. See the example below, which assigns the alias `household_number` to `household_id`

We can also assign an alias or temporary name to a column using the AS command. Aliases can also be used on a table. See the example below, which assigns the alias household_number to household_id

[ ]:
SELECT household_id AS household_number,
<font size="+1">Practice</font>

Practice

Try it yourself! Use SELECT, FROM, AS, and LIMIT to select the first 5 rows from the id_map table. Rename the district_id column to district_number.

[ ]:
%%sql
## Filtering and Sorting Data

Filtering and Sorting Data¶

SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data. 

SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data.

Comparison Operator Description
= Equal
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
<> or != Not equal to
LIKE String comparison test
For example, to select the first 5 homes in Ramechhap (district `2`):

For example, to select the first 5 homes in Ramechhap (district 2):

[ ]:
%%sql
<font size="+1">Practice</font>

Practice

Try it yourself! Use WHERE to select the row with household_id equal to 13735001

[ ]:
%%sql
## Aggregating Data

Aggregating Data¶

Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:

Aggregation Function Definition
MIN Return the minimum value
MAX Return the largest value
SUM Return the sum of values
AVG Return the average of values
COUNT Return the number of observations
Use the `COUNT` function to find the number of observations in the `id_map` table that come from Ramechhap (district `2`):

Use the COUNT function to find the number of observations in the id_map table that come from Ramechhap (district 2):

[ ]:
WHERE district_id = 2
Aggregation functions are frequently used with a `GROUP BY` clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

Aggregation functions are frequently used with a GROUP BY clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:

[ ]:
GROUP BY district_id
 `DISTINCT` is a keyword to select unique records in a query result. For example, if we want to know the unique values in the `district_id` column:

DISTINCT is a keyword to select unique records in a query result. For example, if we want to know the unique values in the district_id column:

[ ]:
SELECT distinct(district_id)
<font size="+1">Practice</font>

Practice

Try it yourself! Use DISTINCT to count the number of unique values in the vdcmun_id column.

[ ]:
%%sql
`DISTINCT` and `COUNT` can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the `district_id` column:

DISTINCT and COUNT can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the district_id column:

[ ]:
SELECT count(distinct(district_id))
<font size="+1">Practice</font>

Practice

Try it yourself! Use DISTINCT and COUNT to count the number of unique values in the vdcmun_id column.

[ ]:
%%sql
# Joining Tables

Joining Tables¶

Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where `table1` and `table2` refer to the two tables being joined, `column1` and `column2` refer to columns to be returned from both tables, and `ID` refers to the common column in the two tables. 

Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where table1 and table2 refer to the two tables being joined, column1 and column2 refer to columns to be returned from both tables, and ID refers to the common column in the two tables.

SELECT table1.column1,
       table2.column2
FROM table_1
JOIN table2 ON table1.id = table1.id
We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding `foundation_type` for the first home in Ramechhap (District 1). We'll start by looking at this single record in the `id_map` table.

We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding foundation_type for the first home in Ramechhap (District 1). We'll start by looking at this single record in the id_map table.

[ ]:
WHERE district_id = 2
This household has `building_id` equal to 23. Let's look at the `foundation_type` for this building, by filtering the `building_structure` table to find this building.

This household has building_id equal to 23. Let's look at the foundation_type for this building, by filtering the building_structure table to find this building.

[ ]:
FROM building_structure
To join the two tables and limit the results to `building_id = 23`:    

To join the two tables and limit the results to building_id = 23:

[ ]:
JOIN building_structure ON id_map.building_id = building_structure.building_id
In addition to the basic `JOIN` clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the `FROM` clause and the right table is the table specified after the `JOIN` clause. If the generic `JOIN` clause is used, then by default the `INNER JOIN` will be used.

In addition to the basic JOIN clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the FROM clause and the right table is the table specified after the JOIN clause. If the generic JOIN clause is used, then by default the INNER JOIN will be used.

JOIN Type Definition
INNER JOIN Returns rows where ID is in both tables
LEFT JOIN Returns rows where ID is in the left table. Return NA for values in column, if ID is not in right table.
RIGHT JOIN Returns rows where ID is in the right table. Return NA for values in column, if ID is not in left table.
FULL JOIN Returns rows where ID is in either table. Return NA for values in column, if ID is not in either table.
WQU WorldQuant University Applied Data Science Lab QQQQ
The video below outlines the main types of joins:

The video below outlines the main types of joins:

[ ]:
YouTubeVideo("2HVMiPPuPIM")
<font size="+1">Practice</font>

Practice

Try it yourself! Use the DISTINCT command to create a column with all unique building IDs in the id_map table. LEFT JOIN this column with the roof_type column from the building_structure table, showing only buildings where district_id is 1 and limiting your results to the first five rows of the new table.

[ ]:
%%sql
# Using pandas with SQL Databases

Using pandas with SQL Databases¶

To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:

[ ]:
conn = sqlite3.connect("/home/jovyan/nepal.sqlite")
To run a query using `sqlite3`, we need to store the query as a string. For example, the variable below called `query` is a string containing a query which returns the first 10 rows from the `id_map` table:

To run a query using sqlite3, we need to store the query as a string. For example, the variable below called query is a string containing a query which returns the first 10 rows from the id_map table:

[ ]:
    FROM id_map
To save the results of the query into a pandas DataFrame, use the `pd.read_sql()` function. The optional parameter `index_col` can be used to set the index to a specific column from the query. 

To save the results of the query into a pandas DataFrame, use the pd.read_sql() function. The optional parameter index_col can be used to set the index to a specific column from the query.

[ ]:
df = pd.read_sql(query, conn, index_col="building_id")
<font size="+1">Practice</font>

Practice

Try it yourself! Use the pd.read_sql function to save the results of a query to a DataFrame. The query should select first 20 rows from the id_map table.

[ ]:
query = ...
# References & Further Reading

References & Further Reading¶

  • Additional Explanation of Magic Commands
  • ipython-SQL User Documentation
  • Data Carpentry Course on SQL in Python
  • SQL Course Material on GitHub (1)
  • SQL Course Material on GitHub (2)
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Machine Learning: Data Pre-Processing and Production</strong></font>

Machine Learning: Data Pre-Processing and Production

[1]:
warnings.simplefilter(action="ignore", category=FutureWarning)
xxxxxxxxxx
# What's scikit-learn?

What's scikit-learn?¶

scikit-learn is a Python library that contains implementations of many common machine learning algorithms and uses common interfaces for these that enables experimentation. In this section, we'll look at linear regression (which you'll use to predict price based on the area of a property) and K-nearest neighbors, which you'll use to classify the neighborhood a property is in.

xxxxxxxxxx
# Data Preprocessing

Data Preprocessing¶

xxxxxxxxxx
# Standardization

Standardization¶

xxxxxxxxxx
**Standardization** is a widely used scaling technique to transform features before fitting into models. Feature scaling changes all a dataset's continuous features to give us a more consistent range of values. Specifically, we subtract the mean from each data point and then divide by the standard deviation:

Standardization is a widely used scaling technique to transform features before fitting into models. Feature scaling changes all a dataset's continuous features to give us a more consistent range of values. Specifically, we subtract the mean from each data point and then divide by the standard deviation:

𝑋̂ =𝑋−𝜇𝜎,X^=X−μσ,

The goal of standardization is to improve model performance having all continuous features be on the same scale. It's useful in at least two circumstances:

  1. For machine leaning algorithms that use Euclidean distance (k-means and k-nearest neighbors), different scales can distort the calculation of distance and hurt model performance.
  2. For dimensionality reduction (principal component analysis), it can improve the model's ability to finds combinations of features that have the most variance.
xxxxxxxxxx
Let's check the following example where we apply standardization on one of the columns in the following DataFrame:

Let's check the following example where we apply standardization on one of the columns in the following DataFrame:

[2]:
df = pd.read_csv("./data/mexico-city-test-features.csv").dropna()
[2]:
surface_covered_in_m2 lat lon neighborhood
0 90.0 19.367931 -99.170262 Benito Juárez
1 50.0 19.363542 -99.224084 Álvaro Obregón
2 280.0 19.457982 -99.192690 Miguel Hidalgo
3 55.0 19.334270 -99.083374 Iztapalapa
4 80.0 19.416881 -99.109781 Venustiano Carranza
xxxxxxxxxx
Our target feature is the `"surface_covered_in_m2"` column. Let's first check the maximum and minimum of this column before standardization:

Our target feature is the "surface_covered_in_m2" column. Let's first check the maximum and minimum of this column before standardization:

[3]:
print("Maximum before standardization is:", df["surface_covered_in_m2"].max())
Maximum before standardization is: 280.0
Minimum before standardization is: 50.0
xxxxxxxxxx
We can perform the transformation by first instantiating the scaler and assigning the feature to a variable name. Then we fit the scaler and transform the data:

We can perform the transformation by first instantiating the scaler and assigning the feature to a variable name. Then we fit the scaler and transform the data:

[4]:
from sklearn.preprocessing import StandardScaler
[5]:
# Fit the scaler to feature
[5]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
[6]:
# Pass the scaler to feature to transform data
[6]:
array([[ 1.62304525e-01],
       [-1.11902808e+00],
       [ 6.24863439e+00],
       ...,
       [ 2.13794980e-03],
       [-3.18195201e-01],
       [ 2.13794980e-03]])
xxxxxxxxxx
Now you can see the transformed data range is much smaller after standardization:

Now you can see the transformed data range is much smaller after standardization:

[7]:
print("Maximum after standardization is:", X_transformed.max())
Maximum after standardization is: 6.248634385593622
Minimum after standardization is: -1.1190280771385155
xxxxxxxxxx
We can also combine the fit and transform process together into one step:

We can also combine the fit and transform process together into one step:

[8]:
X_transformed = scaler.fit_transform(X_train)
[8]:
array([[ 1.62304525e-01],
       [-1.11902808e+00],
       [ 6.24863439e+00],
       ...,
       [ 2.13794980e-03],
       [-3.18195201e-01],
       [ 2.13794980e-03]])
xxxxxxxxxx
<font size="+1">Practice</font>  

Practice

Standardize the price column in "mexico-city-real-estate-1.csv":

[9]:
df1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
[9]:
operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_usd_per_m2 price_per_m2 floor rooms expenses properati_url
0 sell apartment |México|Distrito Federal|Álvaro Obregón| NaN 35000000.0 MXN 35634500.02 1894595.53 NaN NaN NaN NaN NaN NaN NaN http://alvaro-obregon.properati.com.mx/2eb_ven...
1 sell apartment |México|Distrito Federal|Benito Juárez| NaN 2000000.0 MXN 2036257.11 108262.60 NaN NaN NaN NaN NaN NaN NaN http://benito-juarez.properati.com.mx/2ec_vent...
2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 2395.975574 44262.295082 NaN 3.0 NaN http://cuauhtemoc.properati.com.mx/2pu_venta_a...
3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 1952.110000 49585.937500 NaN 5.0 NaN http://cuauhtemoc.properati.com.mx/2pv_venta_a...
4 sell apartment |México|Distrito Federal|Álvaro Obregón| NaN 6870000.0 MXN 6994543.16 371882.03 180.0 136.0 2066.011278 50514.705882 NaN 5.0 NaN http://alvaro-obregon.properati.com.mx/2pw_ven...
[10]:
X_transformed = ...
[10]:
Ellipsis
xxxxxxxxxx
## One-Hot Encoding

One-Hot Encoding¶

xxxxxxxxxx
A property's district is **categorical data**, or data which can be divided into groups.  For many machine learning algorithms, it's common to create a column in a DataFrame to indicate if the feature is present or absent, instead of using the category's name. First you a column for each district names then, for each observation, you put a 1 or a 0 to indicate if the property is located in each neighborhood or not. Let's take a look at the `mexico-city-test-features.csv` dataset for properties which include the district.

A property's district is categorical data, or data which can be divided into groups. For many machine learning algorithms, it's common to create a column in a DataFrame to indicate if the feature is present or absent, instead of using the category's name. First you a column for each district names then, for each observation, you put a 1 or a 0 to indicate if the property is located in each neighborhood or not. Let's take a look at the mexico-city-test-features.csv dataset for properties which include the district.

[11]:
df = pd.read_csv("./data/mexico-city-test-features.csv").dropna()
[11]:
surface_covered_in_m2 lat lon neighborhood
0 90.0 19.367931 -99.170262 Benito Juárez
1 50.0 19.363542 -99.224084 Álvaro Obregón
2 280.0 19.457982 -99.192690 Miguel Hidalgo
3 55.0 19.334270 -99.083374 Iztapalapa
4 80.0 19.416881 -99.109781 Venustiano Carranza
xxxxxxxxxx
You can do one-hot encoding using pandas [`get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function, but we'll use a the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/) library since it allows us to integrate the one hot encoder as a transformer in a scikit-learn Pipeline.

You can do one-hot encoding using pandas get_dummies function, but we'll use a the Category Encoders library since it allows us to integrate the one hot encoder as a transformer in a scikit-learn Pipeline.

[12]:
from category_encoders import OneHotEncoder
[12]:
surface_covered_in_m2 lat lon neighborhood_Benito Juárez neighborhood_Álvaro Obregón neighborhood_Miguel Hidalgo neighborhood_Iztapalapa neighborhood_Venustiano Carranza neighborhood_Tlalpan neighborhood_Coyoacán neighborhood_La Magdalena Contreras neighborhood_Azcapotzalco neighborhood_Cuauhtémoc neighborhood_Cuajimalpa de Morelos neighborhood_Gustavo A. Madero neighborhood_Tláhuac neighborhood_Iztacalco neighborhood_Xochimilco
0 90.0 19.367931 -99.170262 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 50.0 19.363542 -99.224084 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 280.0 19.457982 -99.192690 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 55.0 19.334270 -99.083374 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 80.0 19.416881 -99.109781 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
xxxxxxxxxx
<font size="+1">Practice</font>  

Practice

Create a DataFrame which one-hot encodes the property_type column in mexico-city-real-estate-1.csv. The DataFrame you create should have extra columns for apartments, houses, and stores.

[13]:
    "./data/mexico-city-real-estate-1.csv", usecols=["property_type"]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [13], line 6
      4 ohe = ...
      5 mexico_city1_ohe = ...
----> 6 mexico_city1_ohe.head()

AttributeError: 'ellipsis' object has no attribute 'head'
xxxxxxxxxx
## Ordinal Encoding

Ordinal Encoding¶

xxxxxxxxxx
For many machine learning algorithms, it's common to use one-hot encoding. This works well if there are a few categories, but as the number of features grows, the number of additional columns also grows. 

For many machine learning algorithms, it's common to use one-hot encoding. This works well if there are a few categories, but as the number of features grows, the number of additional columns also grows.

Having a large number of columns (and consequently a large number of features in your model) can lead to a number of issues often referred to as the curse of dimensionality. Two primary issues that can arise are computational complexity (operations performed on larger datasets may take longer) and overfitting (the model may not generalize to new data). In these scenarios, ordinal encoding is a popular choice for encoding the categorical variable. Instead of creating new columns, ordinal encoding simply replaces the categories in a categorical variable with integers.

One potential risk of ordinal encoding is that some machine learning algorithms assume the integer values imply an ordering in the variables. This is important in logistic regression, where a relationship is defined between increases or decreases in the features and the target. Techniques like decision trees are okay to use ordinal encoding, because they generate splits. Rather than assuming any ordering between the numeric values, the splits will occur between the numeric values and effectively separate them. You can use the OrdinalEncoder transformer to perform ordinal encoding:

[ ]:
from category_encoders import OrdinalEncoder
xxxxxxxxxx
<font size="+1">Practice</font>  

Practice

Create a DataFrame which ordinal encodes the property_type column in mexico-city-real-estate-1.csv. The DataFrame you create should have integers replacing the values for apartments, houses, and stores.

[ ]:
    "./data/mexico-city-real-estate-1.csv", usecols=["property_type"]
xxxxxxxxxx
## Imputation

Imputation¶

xxxxxxxxxx
Let's take a look at `mexico-city-real-estate-1.csv` and impute some of the missing values. First, we'll load the dataset, limiting ourselves to the `"surface_covered_in_m2"` and `"price_aprox_usd"` columns.

Let's take a look at mexico-city-real-estate-1.csv and impute some of the missing values. First, we'll load the dataset, limiting ourselves to the "surface_covered_in_m2" and "price_aprox_usd" columns.

[ ]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
xxxxxxxxxx
When you need to build a model using features that contain missing values, one helpful tool is the scikit-learn transformer [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html). In order to use it, we need to start by instantiating the transformer. 

When you need to build a model using features that contain missing values, one helpful tool is the scikit-learn transformer SimpleImputer. In order to use it, we need to start by instantiating the transformer.

[ ]:
from sklearn.impute import SimpleImputer
xxxxxxxxxx
Next, we train the imputer using the data. At this step it will calculate the mean value for each column.

Next, we train the imputer using the data. At this step it will calculate the mean value for each column.

[ ]:
imputer.fit(mexico_city1)
xxxxxxxxxx
Last, we transform the data using the imputer.

Last, we transform the data using the imputer.

[ ]:
mexico_city1_imputed = imputer.transform(mexico_city1)
xxxxxxxxxx
Since the imputer doesn't return a DataFrame, let's transform it into one. 

Since the imputer doesn't return a DataFrame, let's transform it into one.

[ ]:
mexico_city1_imputed = pd.DataFrame(mexico_city1_imputed, columns=columns)
xxxxxxxxxx
Now there are no missing values!

Now there are no missing values!

xxxxxxxxxx
Then we use the imputer to transform the data.

Then we use the imputer to transform the data.

Practice

Read mexico-city-real-estate-1.csv into a DataFrame and impute the missing values for "surface_covered_in_m2" and "price_aprox_usd".WQU WorldQuant University Applied Data Science Lab QQQQ

[ ]:
mexico_city2_imputed = pd.DataFrame(mexico_city2_imputed, columns=columns)
xxxxxxxxxx
## Data Leakage

Data Leakage¶

xxxxxxxxxx
Let's consider the `mexico-city-real-estate-1.csv` dataset and fit a regression model using `surface_covered_in_m2` and `price_aprox_local_currency` to estimate `price_aprox_usd`.

Let's consider the mexico-city-real-estate-1.csv dataset and fit a regression model using surface_covered_in_m2 and price_aprox_local_currency to estimate price_aprox_usd.

[ ]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
xxxxxxxxxx
Now let's calculate the mean absolute error in our training data.

Now let's calculate the mean absolute error in our training data.

[ ]:
    mexico_city1[["surface_covered_in_m2", "price_aprox_local_currency"]]
xxxxxxxxxx
When you see a mean absolute error that's so close to zero (especially when the mean apartment price is so much larger), chances are there is leakage in your model!

When you see a mean absolute error that's so close to zero (especially when the mean apartment price is so much larger), chances are there is leakage in your model!

xxxxxxxxxx
# Imbalanced Data

Imbalanced Data¶

xxxxxxxxxx
When dealing with classification problems, we would ideally expect the training data to be evenly spread across different classes for better model performance. When the numbers of observations are uneven in different classes, we have imbalanced data. The class that represents the majority of observations is called the **majority class**, while the class with limited observation is called the **minority class**. Imbalanced data limits training data available for certain classes. In addition, when the one class takes the majority of the data, the model will keep predicting the majority class to achieve high accuracy result. Thus, prior to training a  model, it is essential to balance the data either through under-sampling the majority classes, or over-sampling the minority classes, or use other evaluation metrics like **recall** or **precision**.

When dealing with classification problems, we would ideally expect the training data to be evenly spread across different classes for better model performance. When the numbers of observations are uneven in different classes, we have imbalanced data. The class that represents the majority of observations is called the majority class, while the class with limited observation is called the minority class. Imbalanced data limits training data available for certain classes. In addition, when the one class takes the majority of the data, the model will keep predicting the majority class to achieve high accuracy result. Thus, prior to training a model, it is essential to balance the data either through under-sampling the majority classes, or over-sampling the minority classes, or use other evaluation metrics like recall or precision.

xxxxxxxxxx
## Under-sampling

Under-sampling¶

xxxxxxxxxx
When data is imbalanced in different classes, one way we can balance it is reducing the number of observations in the majority class. This is called **under-sampling**. We can under-sample by randomly deleting some observations in the majority class. The open source [imbalanced-learn](https://imbalanced-learn.org/stable/) (imported as `imblearn`) is an open-source library that works with `scikit-learn` and provides tools when dealing with imbalanced classes. Here's an example of randomly deleting observations from the majority class using Poland bankruptcy data from 2008.

When data is imbalanced in different classes, one way we can balance it is reducing the number of observations in the majority class. This is called under-sampling. We can under-sample by randomly deleting some observations in the majority class. The open source imbalanced-learn (imported as imblearn) is an open-source library that works with scikit-learn and provides tools when dealing with imbalanced classes. Here's an example of randomly deleting observations from the majority class using Poland bankruptcy data from 2008.

[ ]:
with gzip.open("data/poland-bankruptcy-data-2008.json.gz", "r") as f:
xxxxxxxxxx
The data is clearly imbalanced as there are many more observations in non-bankruptcy compared to bankruptcy.

The data is clearly imbalanced as there are many more observations in non-bankruptcy compared to bankruptcy.

[ ]:
X, y = RandomUnderSampler().fit_resample(df[["company_id"]], df[["bankrupt"]])
xxxxxxxxxx
Now we have reduced the non-bankruptcy class to the same size as the bankruptcy class.

Now we have reduced the non-bankruptcy class to the same size as the bankruptcy class.

xxxxxxxxxx
## Over-sampling

Over-sampling¶

xxxxxxxxxx
**Over-sampling** is the opposite of under-sampling. Instead of reducing the majority class, over-sampling increases the number of observations in the minority class by randomly making copies of the existing observations. Here is an example of making random copies from the minority class using the Poland bankruptcy data and `imblearn`.

Over-sampling is the opposite of under-sampling. Instead of reducing the majority class, over-sampling increases the number of observations in the minority class by randomly making copies of the existing observations. Here is an example of making random copies from the minority class using the Poland bankruptcy data and imblearn.

[ ]:
X, y = RandomOverSampler().fit_resample(df[["company_id"]], df[["bankrupt"]])
xxxxxxxxxx
Now we have increased the bankruptcy class to the size of the non-bankruptcy class.

Now we have increased the bankruptcy class to the size of the non-bankruptcy class.

xxxxxxxxxx
### Practice

Practice¶

xxxxxxxxxx
Now that you've seen an example of imbalanced data and how to under-  or over-sample it prior to model training, let's get some practice with the Poland bankruptcy data from 2007.

Now that you've seen an example of imbalanced data and how to under- or over-sample it prior to model training, let's get some practice with the Poland bankruptcy data from 2007.

[ ]:
with gzip.open("data/poland-bankruptcy-data-2007.json.gz", "r") as f:
xxxxxxxxxx
First, check whether this data is imbalanced.

First, check whether this data is imbalanced.

[ ]:
​x
 
xxxxxxxxxx
Next, do under-sampling.

Next, do under-sampling.

[ ]:
X, y = ...
xxxxxxxxxx
Finally, check whether the data is balanced.

Finally, check whether the data is balanced.

[ ]:
​x
 
xxxxxxxxxx
Great work! Now try over-sampling.

Great work! Now try over-sampling.

[ ]:
X, y = ...
xxxxxxxxxx
And check whether the data is balanced.

And check whether the data is balanced.

[ ]:
​x
 
xxxxxxxxxx
# scikit-learn in Production

scikit-learn in Production¶

xxxxxxxxxx
The previous examples have built models and made predictions one step at a time.  Many machine learning applications will require you to run the same steps many times, usually with new or updated data.  scikit-learn allows you to define a set of steps to process data for machine learning in a reproducible manner using a pipeline. 

The previous examples have built models and made predictions one step at a time. Many machine learning applications will require you to run the same steps many times, usually with new or updated data. scikit-learn allows you to define a set of steps to process data for machine learning in a reproducible manner using a pipeline.

xxxxxxxxxx
## Creating a Pipeline in scikit-learn

Creating a Pipeline in scikit-learn¶

xxxxxxxxxx
First, we create a pipeline to do linear regression on the transformed data set.

First, we create a pipeline to do linear regression on the transformed data set.

[ ]:
lin_reg = linear_model.LinearRegression()
xxxxxxxxxx
We can check the steps in the pipeline, but right now, there's only 1.

We can check the steps in the pipeline, but right now, there's only 1.

[ ]:
pipe.named_steps
xxxxxxxxxx
Then we fit a linear regression model to our data.

Then we fit a linear regression model to our data.

[ ]:
mexico_city1["surface_covered_in_m2"] = mexico_city1["surface_covered_in_m2"].astype(
[ ]:
print(y_pred.head())
xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Try this on the price_aprox_usd column in the mexico-city-real-estate-1.csv dataset.

[ ]:
print(y_pred.head())
xxxxxxxxxx
Let's use the `make_pipeline` function to create a pipeline to fit a linear regression model for the `mexico-city-real-estate-1.csv` dataset.

Let's use the make_pipeline function to create a pipeline to fit a linear regression model for the mexico-city-real-estate-1.csv dataset.

[ ]:
X = mexico_city1.surface_covered_in_m2.values.reshape(-1, 1)
xxxxxxxxxx
Let's try to predict `price_aprox_usd` in the `mexico-city-test-features.csv` dataset.

Let's try to predict price_aprox_usd in the mexico-city-test-features.csv dataset.

[ ]:
mexico_city_features = pd.read_csv("./data/mexico-city-test-features.csv")
xxxxxxxxxx
## Accessing an Object in a Pipeline

Accessing an Object in a Pipeline¶

xxxxxxxxxx
Let's figure out the regression coefficients.

Let's figure out the regression coefficients.

[ ]:
pipe.named_steps["regressor"].coef_
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Now obtain the intercept

[ ]:
# INCLUDE pipe.named_steps[...].intercept_
xxxxxxxxxx
*References & Further Reading*

References & Further Reading

  • One-Hot Encoding with the Category Encoder Package
  • Example of Using One-Hot Encoding
  • Online Example of Using One-Hot Encoding
  • Official pandas Documentation on Get Dummies
  • Online Tutorial on Pipelines for Linear Regression
  • scikit-learn Pipeline Documentation
  • Wikipedia article on the curse of dimensionality
  • Wikipedia Article on Leakage in Machine Learning
  • Official Pandas Documentation on Missing Data
  • Wikipedia Article on Imputation
  • Online Tutorial on Removing Rows with Missing Data
  • scikit-learn Documentation on SimpleImputer
  • imbalanced-learn Documentation
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Visualizing Data: seaborn</strong></font>

Visualizing Data: seaborn

xxxxxxxxxx
There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), [plotly express](../%40textbook/08-visualization-plotly.ipynb), and seaborn. In this section, we'll focus on using seaborn.

There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using seaborn.

xxxxxxxxxx
# Scatter Plots

Scatter Plots¶

xxxxxxxxxx
A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**. 

A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.

In the following example, we will see some scatter plots based on the Mexico City real estate data. Specifically, we can use scatter plot to show how "price" and "surface_covered_in_m2" are correlated. First we need to read the data set and do a little cleaning.

[1]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
[1]:
operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url
2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a...
3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a...
6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_...
7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v...
8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven...
xxxxxxxxxx
Use seaborn to plot the scatter plot for `"price"` and `"surface_covered_in_m2"`:

Use seaborn to plot the scatter plot for "price" and "surface_covered_in_m2":

[2]:
sns.scatterplot(data=mexico_city1, x="price", y="surface_covered_in_m2");
xxxxxxxxxx
There is a very useful argument in `scatterplot` called `hue`. By specifying a categorical column as `hue`, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using `"property_type"`:

There is a very useful argument in scatterplot called hue. By specifying a categorical column as hue, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using "property_type":

[3]:
    data=mexico_city1, x="price", y="surface_covered_in_m2", hue="property_type"
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Plot a scatter plot for "price" and "surface_total_in_m2" by "property_type" for "mexico-city-real-estate-1.csv":

[ ]:
​x
 
xxxxxxxxxx
# Bar Charts

Bar Charts¶

xxxxxxxxxx
A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. 

A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.

In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to import the dataset and extract the borough and other location information from column "place_with_parent_names".

[4]:
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
/tmp/ipykernel_721/836102575.py:12: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
  ] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
[4]:
operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url Country City Borough
2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a... México Distrito Federal Cuauhtémoc
3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a... México Distrito Federal Cuauhtémoc
6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_... México Distrito Federal Miguel Hidalgo
7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v... México Distrito Federal Gustavo A. Madero
8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven... México Distrito Federal Álvaro Obregón
xxxxxxxxxx
Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:

Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:

[5]:
bar_df = pd.DataFrame(mexico_city1["Borough"].value_counts()).reset_index()
[5]:
index Borough
0 Miguel Hidalgo 345
1 Cuajimalpa de Morelos 255
2 Álvaro Obregón 203
3 Benito Juárez 198
4 Tlalpan 171
5 Iztapalapa 134
6 Tláhuac 125
7 Cuauhtémoc 120
8 Gustavo A. Madero 89
9 Venustiano Carranza 81
10 Coyoacán 80
11 La Magdalena Contreras 41
12 Xochimilco 34
13 Iztacalco 27
14 Azcapotzalco 24
15 Milpa Alta 1
xxxxxxxxxx
Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:

Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:

[6]:
ax = sns.barplot(data=bar_df, x="index", y="Borough")
[6]:
[Text(0, 0, 'Miguel Hidalgo'),
 Text(1, 0, 'Cuajimalpa de Morelos'),
 Text(2, 0, 'Álvaro Obregón'),
 Text(3, 0, 'Benito Juárez'),
 Text(4, 0, 'Tlalpan'),
 Text(5, 0, 'Iztapalapa'),
 Text(6, 0, 'Tláhuac'),
 Text(7, 0, 'Cuauhtémoc'),
 Text(8, 0, 'Gustavo A. Madero'),
 Text(9, 0, 'Venustiano Carranza'),
 Text(10, 0, 'Coyoacán'),
 Text(11, 0, 'La Magdalena Contreras'),
 Text(12, 0, 'Xochimilco'),
 Text(13, 0, 'Iztacalco'),
 Text(14, 0, 'Azcapotzalco'),
 Text(15, 0, 'Milpa Alta')]
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Plot a bar plot showing the value counts for property types in "mexico-city-real-estate-1.csv":

[7]:
pro_typ_df = pd.DataFrame(mexico_city1["property_type"].value_counts()).reset_index()
[7]:
<AxesSubplot:xlabel='index', ylabel='property_type'>
xxxxxxxxxx
# Correlation Heatmaps

Correlation Heatmaps¶

A correlation heatmap shows the relative strength of correlations between the variables in a dataset. Here's what the code looks like:

[8]:
mexico_city1_numeric = mexico_city1.select_dtypes(include="number")
[8]:
<AxesSubplot:>
xxxxxxxxxx
Notice that we dropped the columns and rows with missing entries before plotting the graph.

Notice that we dropped the columns and rows with missing entries before plotting the graph.

This heatmap is showing us what we might already have suspected: the price is moderately positively correlated with the size of the properties.

xxxxxxxxxx
<font size="+1">Practice</font>

Practice

The seaborn documentation on heat maps indicates how to add numeric labels to each cell and how to use a different colormap. Modify the plot to use the viridis colormap, have a linewidth of 0.5 between each cell and have numeric labels for each cell.

[ ]:
​x
 
xxxxxxxxxx
# References and Further Reading

References and Further Reading¶

  • Official Plotly Express Documentation on Scatter Plots
  • Official Plotly Express Documentation on 3D Plots
  • Official Plotly Documentation on Notebooks
  • Plotly Community Forum Post on Axis Labeling
  • Plotly Express Official Documentation on Tile Maps
  • Plotly Express Official Documentation on Figure Display
  • Online Tutorial on String Conversion in Pandas
  • Official Pandas Documentation on using Lambda Functions on a Column
  • Official seaborn Documentation on Generating a Heatmap
  • Online Tutorial on Correlation Matrices in Pandas
  • Official Pandas Documentation on Correlation Matrices
  • Official Matplotlib Documentation on Colormaps
  • Official Pandas Documentation on Box Plots
  • Online Tutorial on Box Plots
  • Online Tutorial on Axes Labels in seaborn and Matplotlib
  • Matplotlib Gallery Example of an Annotated Heatmap
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Visualizing Data: plotly express</strong></font>

Visualizing Data: plotly express

xxxxxxxxxx
There are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), plotly express, and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using plotly express.

There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using plotly express.

xxxxxxxxxx
# Scatter Plots

Scatter Plots¶

xxxxxxxxxx
A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**.

A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.

[1]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
[1]:
operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url
2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a...
3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a...
6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_...
7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v...
8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven...
xxxxxxxxxx
After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.

After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.

[2]:
fig = px.scatter(mexico_city1, x="price", y="surface_covered_in_m2")
xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Plot the scatter plot for column "price" and "surface_total_in_m2".

[3]:
fig = px.scatter(mexico_city1,x="price",y="surface_covered_in_m2")
xxxxxxxxxx
# 3D Scatter Plots

3D Scatter Plots¶

Scatter plots can summarize information in a DataFrame. Three dimensional scatter plots look great, but be careful: it can be difficult for people who might not be sure what they're looking at to accurately determine values of points in the plot. Still, scatter plots are useful for displaying relationships between three quantities that would be more difficult to observe in a two dimensional plot.

Let's take a look at the first 50 rows of the mexico-city-real-estate-1.csv dataset.

[4]:
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
/tmp/ipykernel_659/3122900234.py:11: FutureWarning:

In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.

xxxxxxxxxx
Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.

Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.

xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Modify the DataFrame to include columns for the base 10 log of price and surface_covered_in_m2 and then plot these for the entire mexico-city-real-estate-1.csv dataset.

[5]:
import math
xxxxxxxxxx
# Mapbox Scatter Plots

Mapbox Scatter Plots¶

xxxxxxxxxx
A **mapbox scatter plot** is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our `mexico-city-real-estate-1.csv` dataset is a good place to start, because it includes **location data**. After importing the dataset and removing rows with missing data, split the `lat-lon` column into two separate columns: one for `latitude` and the other for `longitude`. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.

A mapbox scatter plot is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our mexico-city-real-estate-1.csv dataset is a good place to start, because it includes location data. After importing the dataset and removing rows with missing data, split the lat-lon column into two separate columns: one for latitude and the other for longitude. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.

[6]:
mexico_city1[["latitude", "longitude"]] = mexico_city1["lat-lon"].str.split(
/tmp/ipykernel_659/3692783844.py:6: FutureWarning:

In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.

xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Create another column in the DataFrame with a log scale of the prices. Then create three separate plots, one for stores, another for houses, and a final one for apartments. Color the points in the plots by the log of the price.

[7]:
from math import log10
xxxxxxxxxx
# Choropleth Maps

Choropleth Maps¶

xxxxxxxxxx
A Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called `px.choropleth` that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set.  

A Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called px.choropleth that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set.

[8]:
    {"State": ["CA", "TX", "NY", "HI", "DE"], "Temparature": [100, 120, 110, 90, 105]}
[8]:
State Temparature
0 CA 100
1 TX 120
2 NY 110
3 HI 90
4 DE 105
[9]:
    df, locations="State", locationmode="USA-states", color="Temparature", scope="usa"
xxxxxxxxxx
# Histogram

Histogram¶

xxxxxxxxxx
A **histogram** is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the `px.histogram()` function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:

A histogram is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the px.histogram() function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:

[10]:
df = pd.read_csv("data/mexico-city-real-estate-1.csv")
xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Check the "surface_covered_in_m2" Histogram.

[11]:
fig = px.histogram(df,x="surface_covered_in_m2")
xxxxxxxxxx
# Boxplots

Boxplots¶

xxxxxxxxxx
A **boxplot** is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.

A boxplot is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.

[12]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
[12]:
operation property_type place_with_parent_names lat-lon price currency price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2 properati_url
2 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 2700000.0 MXN 2748947.10 146154.51 61.0 61.0 44262.295082 http://cuauhtemoc.properati.com.mx/2pu_venta_a...
3 sell apartment |México|Distrito Federal|Cuauhtémoc| 19.41501,-99.175174 6347000.0 MXN 6462061.92 343571.36 176.0 128.0 49585.937500 http://cuauhtemoc.properati.com.mx/2pv_venta_a...
6 sell apartment |México|Distrito Federal|Miguel Hidalgo| 19.456564,-99.191724 670000.0 MXN 682146.11 36267.97 65.0 65.0 10307.692308 http://miguel-hidalgo-df.properati.com.mx/46h_...
7 sell apartment |México|Distrito Federal|Gustavo A. Madero| 19.512787,-99.141393 1400000.0 MXN 1425379.97 75783.82 82.0 70.0 20000.000000 http://gustavo-a-madero.properati.com.mx/46p_v...
8 sell house |México|Distrito Federal|Álvaro Obregón| 19.358776,-99.213557 6680000.0 MXN 6801098.67 361597.08 346.0 346.0 19306.358382 http://alvaro-obregon.properati.com.mx/46t_ven...
xxxxxxxxxx
Check the boxplot for column `"price"`:

Check the boxplot for column "price":

[13]:
fig = px.box(mexico_city1, y="price")
xxxxxxxxxx
If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an `x` argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:

If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an x argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:

[14]:
fig = px.box(mexico_city1, x="property_type", y="price")
xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Check the "surface_covered_in_m2" distribution by property types.

[15]:
fig.show()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [15], line 2
      1 fig = ...
----> 2 fig.show()

AttributeError: 'ellipsis' object has no attribute 'show'
xxxxxxxxxx
# Bar Chart

Bar Chart¶

xxxxxxxxxx
A **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. 

A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.

In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to read the data set and extract Borough and other location information from column "place_with_parent_names".

[ ]:
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
xxxxxxxxxx
We can calculate the number of real estate showing in the data set by Borough using `value_counts()`, then plot it as bar plot:

We can calculate the number of real estate showing in the data set by Borough using value_counts(), then plot it as bar plot:

[ ]:
mexico_city1["Borough"].value_counts()
[ ]:
fig = px.bar(mexico_city1["Borough"].value_counts())
xxxxxxxxxx
We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use `groupby` to calculate the aggregated counts for each Borough and property type combination:

We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use groupby to calculate the aggregated counts for each Borough and property type combination:

[ ]:
size_df = mexico_city1.groupby(["Borough", "property_type"], as_index=False).size()
xxxxxxxxxx
By specifying `x`, `y` and `color`, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note `y` has to be numerical, while `x` and `color` are usually categorical variables.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

By specifying x, y and color, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note y has to be numerical, while x and color are usually categorical variables.WQU WorldQuant University Applied Data Science Lab QQQQ

[ ]:
fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="relative")
xxxxxxxxxx
Note the argument `barmode` is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.

Note the argument barmode is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.

[ ]:
fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="overlay")
xxxxxxxxxx
If we want bars to be placed beside each other, we can specify `barmode` as "group":

If we want bars to be placed beside each other, we can specify barmode as "group":

[ ]:
fig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="group")
xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Plot bar plot for the number of observations by property types in "mexico-city-real-estate-1.csv".

[ ]:
bar_df = ...
xxxxxxxxxx
# References and Further Reading

References and Further Reading¶

  • Official plotly express Documentation on Scatter Plots
  • Official plotly Express Documentation on 3D Plots
  • Official plotly Documentation on Notebooks
  • plotly Community Forum Post on Axis Labeling
  • plotly express Official Documentation on Tile Maps
  • plotly Choropleth Maps in Python Document
  • plotly express Official Documentation on Figure Display
  • Online Tutorial on String Conversion in Pandas
  • Official Pandas Documentation on using Lambda Functions on a Column
  • Official Seaborn Documentation on Generating a Heatmap
  • Online Tutorial on Correlation Matrices in Pandas
  • Official Pandas Documentation on Correlation Matrices
  • Official Matplotlib Documentation on Colormaps
  • Official Pandas Documentation on Box Plots
  • Online Tutorial on Box Plots
  • Online Tutorial on Axes Labels in Seaborn and Matplotlib
  • Matplotlib Gallery Example of an Annotated Heatmap
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Machine Learning: Classification</strong></font>

Machine Learning: Classification

[ ]:
warnings.simplefilter(action="ignore", category=FutureWarning)
xxxxxxxxxx
# Data Preprocessing

Data Preprocessing¶

xxxxxxxxxx
For the examples here, we'll look at buildings in the Ramechhap district of Nepal. (In our SQLite database, Ramechhap has the `district_id` of `1`.) Run the wrangle function below to connect to the SQLite database load the data into the DataFrame `df`.

For the examples here, we'll look at buildings in the Ramechhap district of Nepal. (In our SQLite database, Ramechhap has the district_id of 1.) Run the wrangle function below to connect to the SQLite database load the data into the DataFrame df.

[ ]:
    df["damage_grade"] = pd.to_numeric(df["damage_grade"], errors="coerce")
[ ]:
df.head()
xxxxxxxxxx
# Data Segregation

Data Segregation¶

xxxxxxxxxx
## Training Sets

Training Sets¶

xxxxxxxxxx
### Randomized Train-Test split

Randomized Train-Test split¶

xxxxxxxxxx
**Splitting a dataset** into different sets is an important part of the model development process. The initial dataset is typically split into **two** (**training** and **testing**) or **three** (**training**, **validation**, and **testing**) datasets. This helps ensure that the model can generalize. Usually, more data is used for training than for validation or testing. If splitting into two datasets, a good rule of thumb is to split your data randomly into a ratio of **80:20** training:testing. If splitting into three datasets, splitting the data into a ratio of **70:20:10** (training:validation:testing) is commonly used. 

Splitting a dataset into different sets is an important part of the model development process. The initial dataset is typically split into two (training and testing) or three (training, validation, and testing) datasets. This helps ensure that the model can generalize. Usually, more data is used for training than for validation or testing. If splitting into two datasets, a good rule of thumb is to split your data randomly into a ratio of 80:20 training:testing. If splitting into three datasets, splitting the data into a ratio of 70:20:10 (training:validation:testing) is commonly used.

Validation datasets are usually used to tune model hyperparameters. A hyperparameter is a model setting that can't be learned during model training and must be explicitly set. In contrast, a model parameter can be learned. An example of a hyperparameter is the depth of a decision tree . An example of a model parameter includes a coefficient of a variable from linear regression.

In order to split our data, we'll be using the train_test_split function from scikit-learn. We'll begin by splitting our data into a training and testing set. Next, we'll apply the train_test_split function to our testing set to generate our validation dataset and new testing dataset.

We will create a feature matrix X and target vector y. The target is "severe_damage".

[ ]:
target = "severe_damage"
xxxxxxxxxx
Drop the target from the DataFrame and save the results into a X. Save the target column into y. 

Drop the target from the DataFrame and save the results into a X. Save the target column into y.

[ ]:
X = ...
xxxxxxxxxx
Finally, we will split our dataset into a training and test set using the `train_test_split` function from `scikit-learn`. 

Finally, we will split our dataset into a training and test set using the train_test_split function from scikit-learn.

[ ]:
X_train, X_test, y_train, y_test = train_test_split(
xxxxxxxxxx
## Validation Set

Validation Set¶

xxxxxxxxxx
<font size="+1">Practice: Perform a randomized split using scikit-learn</font>

Practice: Perform a randomized split using scikit-learn

Try it yourself! Use train_test_split to divide the training data (X_train and y_train) into training and validation sets using the same randomized train-test split function used previously. The validation data will be 20% of the previously constructed training data. Don't forget to set a random_state.

[ ]:
X_train, X_val, y_train, y_val = ...
xxxxxxxxxx
# Key Concepts

Key Concepts¶

xxxxxxxxxx
## Majority and Minority Classes

Majority and Minority Classes¶

xxxxxxxxxx
The majority class refers to whatever category in a binary target occurs most frequently, and the minority class refers to whatever category in a binary target occurs less frequently. Let's use the `value_counts` method to plot the relative frequency of the two plots with a bar chart.  

The majority class refers to whatever category in a binary target occurs most frequently, and the minority class refers to whatever category in a binary target occurs less frequently. Let's use the value_counts method to plot the relative frequency of the two plots with a bar chart.

[ ]:
    kind="bar", xlabel="Group", ylabel="Relative Frequency"
xxxxxxxxxx
Since the category 1 (`severe_damage` = True) occurs most frequently, this is the majority class. 

Since the category 1 (severe_damage = True) occurs most frequently, this is the majority class.

xxxxxxxxxx
## Positive and Negative Classes

Positive and Negative Classes¶

xxxxxxxxxx
**Positive class** and **negative class** are the two possible labels for binary classification problems. For example, if we are classifying whether an email is spam or not, we can designate "spam" as the positive class and "not spam" as the negative class. For the example in the project, we have "bankrupt" as the positive class and "not bankrupt" as the negative class. Conventionally, we use `0` or `False` to represent negative class, and `1` or `True` to represent positive class.

Positive class and negative class are the two possible labels for binary classification problems. For example, if we are classifying whether an email is spam or not, we can designate "spam" as the positive class and "not spam" as the negative class. For the example in the project, we have "bankrupt" as the positive class and "not bankrupt" as the negative class. Conventionally, we use 0 or False to represent negative class, and 1 or True to represent positive class.

xxxxxxxxxx
# Classification with Logistic Regression

Classification with Logistic Regression¶

xxxxxxxxxx
## Logistic Regression

Logistic Regression¶

The logistic regression model is the classifier version of linear regression. It will predict probability values that can be used to assign class labels. The model works by taking the output of a linear regression model and feeding it into a sigmoid or logistic function.

Why transform a linear model this way? Linear regression models are great for regression problems because they can give you predictions that range from negative infinity to positive infinity. However, the sigmoid function bounds predictions between 0 and 1, which we then treat as a probability. This allows us to use the model for classification problems.

An example of the sigmoid function is shown below.

[ ]:
x = np.linspace(-10, 10, 100)
xxxxxxxxxx
The following video summarizes the math behind logistic regression:

The following video summarizes the math behind logistic regression:

[ ]:
YouTubeVideo("yIYKR4sgzI8")
xxxxxxxxxx
You can add the logistic regression as a named step in a model pipeline like below:

You can add the logistic regression as a named step in a model pipeline like below:

[ ]:
model = make_pipeline(OneHotEncoder(), LogisticRegression(max_iter=1000))
xxxxxxxxxx
## High-cardinality Features

High-cardinality Features¶

xxxxxxxxxx
Cardinality refers to the number of unique values in a categorical variable. High cardinality means the categorical features have a large number of unique values. These features often don't work well with either one hot encoding or ordinal encoding. There is no exact number of unique values that makes a feature high-cardinality, but if the value of the categorical feature is unique for almost all observations, it can usually be dropped. You can see the number of unique values in a variable by using the `value_counts` method. For example, to check the number of unique values in the `roof_type` column:

Cardinality refers to the number of unique values in a categorical variable. High cardinality means the categorical features have a large number of unique values. These features often don't work well with either one hot encoding or ordinal encoding. There is no exact number of unique values that makes a feature high-cardinality, but if the value of the categorical feature is unique for almost all observations, it can usually be dropped. You can see the number of unique values in a variable by using the value_counts method. For example, to check the number of unique values in the roof_type column:

[ ]:
df["roof_type"].value_counts()
xxxxxxxxxx
There are only three unique values, so we will leave the column in the DataFrame. 

There are only three unique values, so we will leave the column in the DataFrame.

Practice

Try it yourself! Use value_counts to check the number of unique values in the building_id column. Remove the column in the wrangle function if it has a large number of unique values.

[ ]:
# X_train["building_id"].value_counts()
xxxxxxxxxx
# Classification with Tree-based Models

Classification with Tree-based Models¶

xxxxxxxxxx
## Decision Trees

Decision Trees¶

xxxxxxxxxx
Decision trees are a general class of machine learning models that are used for both classification and regression. The model resemble a tree, complete with branches and leaves. The model is essentially a series of questions with "yes" or "no" answers. The decision tree starts by checking whatever condition does the best job at correctly separating the data into the two classes in the binary target. It then progressively checks more conditions until it can predict an observation's label. They are popular because they are more flexible than linear models and intuitive in a way that makes them easy to explain to stakeholders who are not familiar with data science.  

Decision trees are a general class of machine learning models that are used for both classification and regression. The model resemble a tree, complete with branches and leaves. The model is essentially a series of questions with "yes" or "no" answers. The decision tree starts by checking whatever condition does the best job at correctly separating the data into the two classes in the binary target. It then progressively checks more conditions until it can predict an observation's label. They are popular because they are more flexible than linear models and intuitive in a way that makes them easy to explain to stakeholders who are not familiar with data science.

xxxxxxxxxx
Decision trees pros and cons:

Decision trees pros and cons:

Pros Cons
can be used for classification and regression generalization: they are prone to overfitting
handles both numerical and categorical data robustness: small variations in data can result in a different tree
models nonlinear relationships between the features and target class imbalance: if one class is much larger than the other, the tree may be unbalanced
xxxxxxxxxx
The following video summarizes a decision tree:

The following video summarizes a decision tree:

[ ]:
YouTubeVideo("7VeUPuFGJHk")
xxxxxxxxxx
We will fit a decision tree to the training data, using an ordinal encoder to encode the categorical features:

We will fit a decision tree to the training data, using an ordinal encoder to encode the categorical features:

[ ]:
    OrdinalEncoder(), DecisionTreeClassifier(max_depth=6, random_state=42)
xxxxxxxxxx
## Prediction

Prediction¶

xxxxxxxxxx
## Probability Estimates

Probability Estimates¶

xxxxxxxxxx
Sometimes a model makes the same prediction for the target of two observations, but is more certain about one prediction. This is the difference between the prediction and the prediction's associated probability. 

Sometimes a model makes the same prediction for the target of two observations, but is more certain about one prediction. This is the difference between the prediction and the prediction's associated probability.

The predict method predicts the target of an unlabeled observation. The predict_proba outputs the probability that an unlabeled observation belongs to one of two classes in the target. Both methods work similarly. They each are run on the fitted model and take a set of features as their input. For example, if we want to see the associated predictions if we used the X_train as an input:

[ ]:
model.predict(X_train)
xxxxxxxxxx
And if we wanted to see the associated probabilities for these predictions:

And if we wanted to see the associated probabilities for these predictions:

[ ]:
model.predict_proba(X_train)
xxxxxxxxxx
Note that there are two probability estimates for each observation, one for the likelihood of each class in the target. The second probability is the likelihood that the unknown observation belongs to the class equal to 1 and the first probability is the likelihood that the unknown observation belongs to the class equal to 0. Whichever class's probability is higher is the predicted class from `predict`. 

Note that there are two probability estimates for each observation, one for the likelihood of each class in the target. The second probability is the likelihood that the unknown observation belongs to the class equal to 1 and the first probability is the likelihood that the unknown observation belongs to the class equal to 0. Whichever class's probability is higher is the predicted class from predict.

xxxxxxxxxx
<font size="+1">Practice: Generate probability estimates using a trained model in scikit-learn</font>

Practice: Generate probability estimates using a trained model in scikit-learn

Try it yourself! Use predict_proba to generate probability estimates for the observations in X_test.

[ ]:
model.predict_proba(X_test)
xxxxxxxxxx
## Evaluation

Evaluation¶

xxxxxxxxxx
### Calculating Accuracy Score

Calculating Accuracy Score¶

xxxxxxxxxx
A natural choice for a metric for classification is accuracy. Accuracy is equal to the number of observations you correctly classified over all observations. For example, if your model properly identified 77 out of 100 images, you have an accuracy of 77%. Accuracy is an easy metric to both understand and calculate. Mathematically, it is simply

A natural choice for a metric for classification is accuracy. Accuracy is equal to the number of observations you correctly classified over all observations. For example, if your model properly identified 77 out of 100 images, you have an accuracy of 77%. Accuracy is an easy metric to both understand and calculate. Mathematically, it is simply

number of correct observationsnumber of observations.number of correct observationsnumber of observations.

Model accuracy can be calculated using the accuracy_score function. The function requires two arguments, the true labels and the predicted labels. For example, if we want to calculate the model accuracy score on the training data:

[ ]:
acc_train = accuracy_score(y_train, model.predict(X_train))
xxxxxxxxxx
<font size="+1">Practice: Calculate the accuracy score for a model in scikit-learn</font>

Practice: Calculate the accuracy score for a model in scikit-learn

Try it yourself! Calculate the model's accuracy on the validation data:

[ ]:
print("Validation Accuracy:", round(acc_val, 2))
xxxxxxxxxx
### Baseline Accuracy Score

Baseline Accuracy Score¶

How do you know whether or not the accuracy score you calculated for your model is good? A baseline accuracy score for the model can be used to compare your model accuracy results against. A common baseline is to use the percentage that the majority class shows up in the training data. This would be your accuracy if you simply predicted the majority class for all observations. If the model is not beating this baseline, that suggests that the features are not adding any valuable information to classify your observations.

We can use the value_counts method with the normalize = True argument to calculate the baseline accuracy:

[ ]:
acc_baseline = y_train.value_counts(normalize=True).max()
xxxxxxxxxx
### Confusion Matrix

Confusion Matrix¶

xxxxxxxxxx
Accuracy score may not provide enough information to assess how a model is performing because it only gives us an overall score. Also, imbalanced data can lead to a high accuracy score even when a model isn't particularly useful. If we want to know what fraction of all positive predictions were correct and what fraction of positive observations did we identify, we can use a **confusion matrix**.

Accuracy score may not provide enough information to assess how a model is performing because it only gives us an overall score. Also, imbalanced data can lead to a high accuracy score even when a model isn't particularly useful. If we want to know what fraction of all positive predictions were correct and what fraction of positive observations did we identify, we can use a confusion matrix.

A confusion matrix is a table summarizing the performance of the model by enumerating true and false positives and the true and false negatives.

Positive Observation Negative Observation
Positive Prediction True Positive (TP) False Positive (FP)
Negative Prediction False Negative (FN) True Negative (TN)
xxxxxxxxxx
Refer to this video for more details in confusion matrix:

Refer to this video for more details in confusion matrix:

[ ]:
YouTubeVideo("_cpiuMuFj3U")
xxxxxxxxxx
Here is the code to get the confusion matrix in the training set:

Here is the code to get the confusion matrix in the training set:

[ ]:
cm = confusion_matrix(y_train, model.predict(X_train))
xxxxxxxxxx
You can also use the heatmap to better visualize confusion matrix using `ConfusionMatrixDisplay`:

You can also use the heatmap to better visualize confusion matrix using ConfusionMatrixDisplay:

[ ]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Get confusion matrix for the validation set and display with ConfusionMatrixDisplay:

[ ]:
disp.plot()
xxxxxxxxxx
### Precision Score

Precision Score¶

xxxxxxxxxx
Depending on the context of the problem, instead of knowing model performances in both classes, sometimes we are more interested in the results in positive class. That's when we use **precision**. Precision is the fraction of true positives over all positive predictions. It is a measure of how "precise" our model is with regard to labeling observations as positive. 

Depending on the context of the problem, instead of knowing model performances in both classes, sometimes we are more interested in the results in positive class. That's when we use precision. Precision is the fraction of true positives over all positive predictions. It is a measure of how "precise" our model is with regard to labeling observations as positive.

For example in Project 3, we try to predict whether a company will go bankrupt, with "bankrupt" as the positive class. Out of all positive predictions made by the model, some companies actually went bankrupt (True Positive TP), while others didn't (False Positive FP). Precision measures how many times model predicted positives (TP+FP) correctly (TP). The equation for precision is:

precision=TP𝑇𝑃+𝐹𝑃precision=TPTP+FP

xxxxxxxxxx
Using the data and model above, we can get a precision score using the code below:

Using the data and model above, we can get a precision score using the code below:

[ ]:
precision = precision_score(y_train, model.predict(X_train))
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Get precision for the validation set

[ ]:
print(f"Validation Set Precision is {round(precision_val, 2)}")
xxxxxxxxxx
### Recall Score

Recall Score¶

xxxxxxxxxx
What if we care more about the model performance in the negative class? In this case, we need to calculate **recall**. Recall the fraction of true positives over all positive observations. It is a measure of our model's ability to "catch" and properly label observations that are positive. 

What if we care more about the model performance in the negative class? In this case, we need to calculate recall. Recall the fraction of true positives over all positive observations. It is a measure of our model's ability to "catch" and properly label observations that are positive.

Let's return to the Poland bankruptcy example. Of all the companies that actually went bankrupt (TP+FN), how many companies did out model predict as going bankrupt (TP)? That's what recall measures. The equation to calculate recall is:

Recall=TPTP+FN.Recall=TPTP+FN.

xxxxxxxxxx
Here is the code to calculate recall:

Here is the code to calculate recall:

[ ]:
recall = recall_score(y_train, model.predict(X_train))
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Get precision for the validation set

[ ]:
print(f"Validation Set Precision is {round(recall_val, 2)}")
xxxxxxxxxx
### Classification Report

Classification Report¶

xxxxxxxxxx
We can also use a **classification report** to look at the whole picture of the classification model performances. A classification report includes precision, recall, **F1 score** and **support**. We already know the first two, but F1 score is the harmonic mean of precision and recall, it equation is:

We can also use a classification report to look at the whole picture of the classification model performances. A classification report includes precision, recall, F1 score and support. We already know the first two, but F1 score is the harmonic mean of precision and recall, it equation is:

F1=2(precision⋅recallprecision+recall)F1=2(precision⋅recallprecision+recall)

Support number of observations for each class, thus it is useful to understand whether the data is imbalanced or not.

[ ]:
print(classification_report(y_train, model.predict(X_train)))
xxxxxxxxxx
Note in the last two rows, we have the macro and the weighted average,. Macro average is the arithmetic average of a metric between the two classes:

Note in the last two rows, we have the macro and the weighted average,. Macro average is the arithmetic average of a metric between the two classes:

0.65=0.70+0.6020.65=0.70+0.602

The weighted average is calculated as:

∑(metric of interest⋅weight)∑(weights)∑(metric of interest⋅weight)∑(weights)

Here the weights are the number of observation for each class.

xxxxxxxxxx
Here you may notice there are two rows of metrics. If you refer back to what we calculated previous on precision and recall, the second row actually align with what we found. That's because we usually define class one as the **positive class**, thus we are referring class 1's metric performance as the true precision and recall value.

Here you may notice there are two rows of metrics. If you refer back to what we calculated previous on precision and recall, the second row actually align with what we found. That's because we usually define class one as the positive class, thus we are referring class 1's metric performance as the true precision and recall value.

xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Get classification report for the validation set

[ ]:
print(...)
xxxxxxxxxx
## Communication

Communication¶

xxxxxxxxxx
### Plotting a Decision Tree

Plotting a Decision Tree¶

xxxxxxxxxx
The `plot_tree` function can be used to a plot a decision tree. The visualization is fit to the size of the axis set with `matplotlib`. Use the `figsize` argument of `plt.subplots` to control the size of the tree.

The plot_tree function can be used to a plot a decision tree. The visualization is fit to the size of the axis set with matplotlib. Use the figsize argument of plt.subplots to control the size of the tree.

xxxxxxxxxx
We'll demonstrate how to use the `plot_tree` function to graphically display a decision tree:

We'll demonstrate how to use the plot_tree function to graphically display a decision tree:

[ ]:
    proportion=True,  # Display proportion of classes in leaf
xxxxxxxxxx
<font size="+1">Practice: Plot a decision tree using `scikit-learn`</font>

Practice: Plot a decision tree using scikit-learn

Try it yourself! Use plot_tree to plot the decision tree from the model object, modifying the parameters of the tree to only display the first 3 levels and to not display the proportion of classes in a leaf.

[ ]:
fig, ax = plt.subplots(figsize=(25, 12))
xxxxxxxxxx
The feature names and importance of features can be extracted from the column names in your training set. For the `importances`, you can access the [`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.feature_importances_) attribute of your model's `DecisionTreeClassifier`. 

The feature names and importance of features can be extracted from the column names in your training set. For the importances, you can access the feature_importances_ attribute of your model's DecisionTreeClassifier.

[ ]:
importances = model.named_steps["decisiontreeclassifier"].feature_importances_
xxxxxxxxxx
The importance of a feature is based on how well the feature correctly classifies observations. In a decision tree, this is on average how much a feature reduces the impurity metric. The tree determines how to split based on an impurity function. The impurity function calculates how homogeneous observations are at a particular leaf node. Conditions that do a better job minimizing impurity are used to split first. The `sklearn.tree` algorithm uses the Gini impurity by default. The Gini impurity measures the probability of an incorrect classification in the model for each branch. It ranges from 0 to 1. 

The importance of a feature is based on how well the feature correctly classifies observations. In a decision tree, this is on average how much a feature reduces the impurity metric. The tree determines how to split based on an impurity function. The impurity function calculates how homogeneous observations are at a particular leaf node. Conditions that do a better job minimizing impurity are used to split first. The sklearn.tree algorithm uses the Gini impurity by default. The Gini impurity measures the probability of an incorrect classification in the model for each branch. It ranges from 0 to 1.

xxxxxxxxxx
Let's create a bar chart to plot each feature with its corresponding importance. To build this bar chart, we'll create a pandas Series named feat_imp, where the index is features and the values are your importances. The Series should be sorted from smallest to largest importance so that the bar chart is also in order. 

Let's create a bar chart to plot each feature with its corresponding importance. To build this bar chart, we'll create a pandas Series named feat_imp, where the index is features and the values are your importances. The Series should be sorted from smallest to largest importance so that the bar chart is also in order.

[ ]:
feat_imp = pd.Series(importances, index=features).sort_values()
xxxxxxxxxx
Next, we'll use the series to build a bar chart:

Next, we'll use the series to build a bar chart:

[ ]:
plt.xlabel("Gini Importance");
xxxxxxxxxx
# Classification with Ensemble Models

Classification with Ensemble Models¶

xxxxxxxxxx
Ensemble models are machine learning models that use more than one predictor to arrive at a prediction. A group of predictors form an _ensemble_. In general, ensemble models perform better than using a single predictor. There are three types of ensemble models: **bagging**, **boosting**, and **blending**. Of the three, decision trees are commonly used to construct bagging and boosting models.

Ensemble models are machine learning models that use more than one predictor to arrive at a prediction. A group of predictors form an ensemble. In general, ensemble models perform better than using a single predictor. There are three types of ensemble models: bagging, boosting, and blending. Of the three, decision trees are commonly used to construct bagging and boosting models.

xxxxxxxxxx
## Random Forest

Random Forest¶

xxxxxxxxxx
The performance of a single decision tree will be limited. Instead of relying on one tree, a better approach is to aggregate the predictions of multiple trees. On average, aggregation will perform better than a single predictor. You can envision the aggregation as mimicking the idea of "wisdom of the crowd." We call a tree based model that aggregates the predictions of multiple trees a **random forest**.

The performance of a single decision tree will be limited. Instead of relying on one tree, a better approach is to aggregate the predictions of multiple trees. On average, aggregation will perform better than a single predictor. You can envision the aggregation as mimicking the idea of "wisdom of the crowd." We call a tree based model that aggregates the predictions of multiple trees a random forest.

In order for a random forest to be effective, the model needs a diverse collection of trees. There should be variations in the chosen thresholds for splitting and the number of nodes and branches. There is no point in aggregating the predicted results if all the trees are nearly identical and produce the same result. There is no "wisdom of the crowd" if everyone thinks alike. To achieve a diverse set of trees, we need to:

  1. Train each tree in the forest using a different subset of the training set.
  2. Only consider a subset of features when deciding how to split the nodes.

On the first point, we would ideally generate a new training set for each tree. However, oftentimes it's too difficult or expensive to collect more data, so we have to make do with what we have. Bootstrapping is a general statistical technique to generate "new" data sets with a single set by random sampling with replacement. Sampling with replacement allows for a data point to be sampled more than once.

Typically, when training the standard decision tree model, the algorithm will consider all features in deciding the node split. Considering only a subset of your features ensures that your trees do not resemble each other. If the algorithm had considered all features, a dominant feature would be continuously chosen for node splits.

The hyperparameters available for random forests include those of decision tress with some additions.

Hyperparameter Description
n_estimators The number of trees in the forest
max_samples If bootstrap is True, the number of samples to draw from X to train each base estimator
max_features The number of features to consider when looking for the best split
n_jobs The number of jobs to run in parallel when fitting and predicting
warm_start If set to True, reuse the trained tree from a prior fitting and just train the additional trees

Since the random forest is based on idea of bootstrapping and aggregating the results, it is referred to as a bagging ensemble model.

xxxxxxxxxx
## Gradient Boosting Trees

Gradient Boosting Trees¶

xxxxxxxxxx
Gradient boosting trees is another ensemble model. It uses a collection of tree models arranged in a sequence. Here, the model is built stage-wise; each additional tree aims to correct the previous tree's incorrect. 

Gradient boosting trees is another ensemble model. It uses a collection of tree models arranged in a sequence. Here, the model is built stage-wise; each additional tree aims to correct the previous tree's incorrect.

Where does the name gradient in gradient boosting trees come from? Gradient descent is a minimization algorithm that updates/improves the current answer by taking a step in the direction of minimizing the loss function. This is the same as the gradient boosting trees algorithm as it adds trees to minimize loss/improve model performance. The term boosting refers to the algorithm's ability to combine multiple weak models in sequence to form a stronger model.

xxxxxxxxxx
Gradient boosting trees have a similar set of hyperparameters as random forests but with some key additions.

Gradient boosting trees have a similar set of hyperparameters as random forests but with some key additions.

Hyperparameter Description
learning_rate Multiplicative factor of the tree's contribution to the model.
subsample Fraction of the training data to use when fitting the trees.
xxxxxxxxxx
The learning rate determines how much each tree affect the final outcome and is very important in model convergence. Thus it should be considered during hyperparameter tuning to improve model performance.

The learning rate determines how much each tree affect the final outcome and is very important in model convergence. Thus it should be considered during hyperparameter tuning to improve model performance.

xxxxxxxxxx
# Hyperparameter Tuning

Hyperparameter Tuning¶

xxxxxxxxxx
When we defined our decision tree estimator, we chose how many layers the tree would have using the `max_depth` argument. More generally, when we instantiate any estimator, we can pass keyword arguments that will dictate its structure. The decision tree regressor accepts [12 different keyword arguments](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor). These arguments are called **hyperparameters**. This is in contrast to **parameters**, which are the numbers that our model uses to predict labels based on features. Hyperparameters are decided before training and dictate the model's structure. Parameters are optimized during training. Basically all models have hyperparameters. Even a simple linear regressor has a hyperparameter: `fit_intercept`.

When we defined our decision tree estimator, we chose how many layers the tree would have using the max_depth argument. More generally, when we instantiate any estimator, we can pass keyword arguments that will dictate its structure. The decision tree regressor accepts 12 different keyword arguments. These arguments are called hyperparameters. This is in contrast to parameters, which are the numbers that our model uses to predict labels based on features. Hyperparameters are decided before training and dictate the model's structure. Parameters are optimized during training. Basically all models have hyperparameters. Even a simple linear regressor has a hyperparameter: fit_intercept.

Since changing a hyperparameters will change the structure of the model, we should think of choosing hyperparameters as part of the model building process. We can usually use cross validation combined with grid search when looking for the best model with the right hyperparameters. This process is called hyperparameter tuning.

xxxxxxxxxx
## Cross-Validation

Cross-Validation¶

xxxxxxxxxx
When trying out different hyperparameter settings for estimators (such as the `max_depth` for a random forest), there's a risk in using the test set to evaluate these settings. What happens is that your knowledge about the test set can “leak” into the model, and performance metrics no longer reflect the model's ability to generalize. 

When trying out different hyperparameter settings for estimators (such as the max_depth for a random forest), there's a risk in using the test set to evaluate these settings. What happens is that your knowledge about the test set can “leak” into the model, and performance metrics no longer reflect the model's ability to generalize.

The generalization problem can be solved adding an extra set called validation set. In this case, we train the model with the training set, then evaluate different hyperparameters using the validation set. If the model is performing well in both sets, finally we will evaluate the model on the test set.

But there's a drawback to this strategy. The potential issue we may face dividing data into three sets is that we will reduce the number of samples available to fit and train the model. In addition, the model results will change with respect to difference choices of training and validation portions.

The solution here is to use cross validation (CV for short). In this case, we will still use a test set, but a validation set is no longer needed. k-fold CV is the most used cross validation method(http://scikit-learn.org/stable/modules/cross_validation.html#k-fold). The algorithm divides the training set into 𝑘k small folds. For each fold 𝑘k, we:

  1. Train the model using all the folds but one (i.e. 𝑘−1k−1 folds) as training data;

  2. Validate the model using the remaining fold as if it were test data, and store the performance metric;

This approach makes the best use of all the data we are given, so it's particularly useful when the sample size is small.

xxxxxxxxxx
Here is the code for conducting a 5-fold cross-validation and reporting the accuracy score for each fold.

Here is the code for conducting a 5-fold cross-validation and reporting the accuracy score for each fold.

[ ]:
clf = make_pipeline(OneHotEncoder(), DecisionTreeClassifier())
[ ]:
    f"{round(scores.mean(),2)} accuracy with a standard deviation of {round(scores.std(),2)}"
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Perform 2-fold Cross ValidationWQU WorldQuant University Applied Data Science Lab QQQQ

[ ]:
scores = ...
xxxxxxxxxx
## Grid Search

Grid Search¶

xxxxxxxxxx
Another a useful tool for comparing different hyperparameter values is `GridSearchCV`. There are two ideas behind `GridSearchCV`: first we split the data using k-fold cross-validation, and then we train and evaluate models with different hyperparameter settings selected from a grid of possible combinations.

Another a useful tool for comparing different hyperparameter values is GridSearchCV. There are two ideas behind GridSearchCV: first we split the data using k-fold cross-validation, and then we train and evaluate models with different hyperparameter settings selected from a grid of possible combinations.

xxxxxxxxxx
First, we need to define the hyperparameters we want to tune, and tuning in what range. Here we are using an example of searching the best value for the `max_depth` in decision tree model. Since we will be building a pipeline including a transformer and estimator, we need to specify `max_depth` comes from the estimator `decisiontreeclassifier`.

First, we need to define the hyperparameters we want to tune, and tuning in what range. Here we are using an example of searching the best value for the max_depth in decision tree model. Since we will be building a pipeline including a transformer and estimator, we need to specify max_depth comes from the estimator decisiontreeclassifier.

[ ]:
params = {"decisiontreeclassifier__max_depth": range(1, 15)}
xxxxxxxxxx
The we define the pipeline and the model with `GridSearchCV`.

The we define the pipeline and the model with GridSearchCV.

[ ]:
clf = make_pipeline(OneHotEncoder(), DecisionTreeClassifier())
xxxxxxxxxx
Lastly fit the model:

Lastly fit the model:

[ ]:
model.fit(X_train, y_train)
xxxxxxxxxx
We can check the best parameters once the fitting process finished:

We can check the best parameters once the fitting process finished:

[ ]:
model.best_params_
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Perform GridSearchCV on both max_depth and criterion for the validation set

[ ]:
# Check the best hyperparameters
xxxxxxxxxx
# References & Further Reading

References & Further Reading¶

  • scikit-learn documentation on decision tree classifier model object
  • scikit-learn documentation on decision tree plot
  • scikit-learn documentation on decision tree math
  • scikit-learn confusion matrix
  • scikit-learn precision score
  • scikit-learn recall score
  • scikit-learn classification report
  • YoutubeConfusion Matrix
  • scikit-learnCross Validation
  • scikit-learnk-fold Cross Validation
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

<font size="+3"><strong>Time Series: Core Concepts</strong></font>

Time Series: Core Concepts

[1]:
 
from IPython.display import YouTubeVideo
# Model Types

Model Types¶

## Autoregression Models

Autoregression Models¶

Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works in a similar way to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.

## ARMA Models

ARMA Models¶

ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.

Below is a video from Ritvik Kharkar that explains MA models in terms of cupcakes and crazy professors — two things we love!

[2]:
 
YouTubeVideo("voryLhxiPzE")
[2]:
And in this video, Ritvik talks about the ARMA model we use in Project 3. 

And in this video, Ritvik talks about the ARMA model we use in Project 3.

[3]:
 
YouTubeVideo("HhvTlaN06AM")
[3]:
# Plots

Plots¶

## ACF Plot

ACF Plot¶

When we've worked with autocorrelations in the past, we've treated them like static relationships, but that's not always how they work. Sometimes, we want to actually see how those autocorrelations change over time, which means we need to think of them as *functions*. When we create a visual representation of an autocorrelation function (ACF), we're making an **ACF plot**.

When we've worked with autocorrelations in the past, we've treated them like static relationships, but that's not always how they work. Sometimes, we want to actually see how those autocorrelations change over time, which means we need to think of them as functions. When we create a visual representation of an autocorrelation function (ACF), we're making an ACF plot.

## PACF Plot

PACF Plot¶

Autocorrelations take into account two types of observations. Direct observations are the ones that happen exactly at our chosen time-step interval; we might have readings at one-hour intervals starting at 1:00. Indirect observations are the ones that happen between our chosen time-step intervals, at time-steps like 1:38, 2:10, 3:04, etc. Those indirect observations might be helpful, but we can't be sure about that, so it's a good idea to strip them out and see what our graph looks like when it's only showing us direct observations.

An autocorrelation that only includes the direct observations is called a partial autocorrelation, and when we view that partial autocorrelation as a function, we call it a PACF.

PACF plots represent those things visually. We want to compare our ACF and PACF plots to see which model best describes our time series. If the ACF data drops off slowly, then that's going to be a better description; if the PACF falls off slowly, then that's going to be a better description.

# Statistical Concepts

Statistical Concepts¶

## Walk-Forward Validation

Walk-Forward Validation¶

Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past.

## Parameters

Parameters¶

Parameters are the parts of the model that are learned from the training data.

## Hyperparameters

Hyperparameters¶

We've already seen that parameters are elements that a machine learning model learns from the training data. Hyperparameters, on the other hand, are elements of the model that come from somewhere else. Data scientists choose hyperparameters either by examining the data themselves, or by creating some kind of automated testing of different options to see how they perform. Hyperparameters are set before the model is trained, which means that they significantly impact how the model is trained, and how it subsequently performs. One way of thinking about the difference between the two is that parameters come from inside the model, and hyperparameters come from outside the model.

When we think about hyperparameters, we think in terms of p values and q values. p values represent the number of lagged observations included in the model, and the q is the size of the moving average window. These values count as hyperparameters because we get to decide what they are. How many lagged observations do we want to include? How big should our window of moving averages be?

## Rolling Averages

Rolling Averages¶

A rolling average is the mean value of multiple subsets of numbers in a dataset. For example, I might have data relating to the daily income for a shop I own, and as long as the shop stays open, I can calculate a rolling average. On Friday, I might calculate the average income from Monday-Thursday. The next Monday, I might calculate the average income from Tuesday-Friday, and the next day, I might calculate the average income from Wednesday to Monday, and so on. These averages roll, giving me a sense for how the data is changing in relation to any kind of static construct. In this case, and in many data science applications, that construct is time. Calculating rolling averages is helpful for making accurate forecasts about the ways data will change in the future.

---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

<font size="+3"><strong>Machine Learning: Linear Regression</strong></font>

Machine Learning: Linear Regression

# Linear Regression

Linear Regression¶

[1]:
from IPython.display import YouTubeVideo
In machine learning, a **regression** problem is when you need to build a model that's going to predict a continuous, numerical value, like the sale price of an apartment. One of the models that you can use for regression problems is called **linear regression**. In it's simplest form, we fit a model that will predict a single output variable (called a **target vector**) as a linear function of a single input variable (called a **feature matrix**). 

In machine learning, a regression problem is when you need to build a model that's going to predict a continuous, numerical value, like the sale price of an apartment. One of the models that you can use for regression problems is called linear regression. In it's simplest form, we fit a model that will predict a single output variable (called a target vector) as a linear function of a single input variable (called a feature matrix).

Speaking mathematically, if we have input data points 𝑥x and corresponding measured output 𝑦y, then we find parameters 𝑚m and 𝑏b such that 𝑦≈𝑚×𝑥+𝑏y≈m×x+b for our measured data points. We then use the fitted values of 𝑚m and 𝑏b to predict values of 𝑦y for new values of 𝑥x.

## Fitting a Model to Training Data

Fitting a Model to Training Data¶

You'll work on two cases: a model on the raw data set and a model on transformed data. First try to use linear regression to predict `price_aprox_usd` as a multiple of `surface_covered_in_m2` and the addition of a constant for the `mexico-city-real-estate-1.csv` dataset.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

You'll work on two cases: a model on the raw data set and a model on transformed data. First try to use linear regression to predict price_aprox_usd as a multiple of surface_covered_in_m2 and the addition of a constant for the mexico-city-real-estate-1.csv dataset.WQU WorldQuant University Applied Data Science Lab QQQQ

[2]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
[2]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
<font size="+1">Practice</font> 

Practice

Fit a linear regression model to the mexico-city-real-estate-2.csv data set to relate "price_aprox_usd" and "surface_covered_in_m2".

[3]:
mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv", usecols=columns)
[3]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
## Generating Predictions Using a Trained Model

Generating Predictions Using a Trained Model¶

After fitting the model, we want to use it to make predictions. In most applications, you'll want to predict an unknown quantity from data that's different from the data you've fitted our model on. To test the accuracy of your fitted model, you'll typically use a different set of data with an outcome you already know. Here, we'll use the dataset from `mexico-city-test-features.csv` and `mexico-city-test-labels.csv`.  It's also helpful to plot the data and predicted data to see if there are any patterns that suggest fitting a different model.

After fitting the model, we want to use it to make predictions. In most applications, you'll want to predict an unknown quantity from data that's different from the data you've fitted our model on. To test the accuracy of your fitted model, you'll typically use a different set of data with an outcome you already know. Here, we'll use the dataset from mexico-city-test-features.csv and mexico-city-test-labels.csv. It's also helpful to plot the data and predicted data to see if there are any patterns that suggest fitting a different model.

[4]:
    "./data/mexico-city-test-features.csv", usecols=["surface_covered_in_m2"]
[4]:
array([309549.84644749, 309411.49120101, 310207.03386826, 309428.78560682,
       309515.25763587])
<font size="+1">Practice</font> 

Practice

Read the data from mexico-city-real-estate-4.csv into a DataFrame and then generate a list of price predictions for the properties using your model lr.

[5]:
mexico_city4 = pd.read_csv( "./data/mexico-city-test-features.csv",usecols=["surface_covered_in_m2"])
[5]:
array([309549.84644749, 309411.49120101, 310207.03386826, 309428.78560682,
       309515.25763587])
## Ridge Regression

Ridge Regression¶

Sometimes,the values for coefficients and the intercept - both positive and negative - are very large. When you see this in a linear model — especially a high-dimensional model — what's happening is that the model is overfitting to the training data and then can't generalize to the test data. Some people call this the curse of dimensionality. ☠️

The way to solve this problem is to use regularization, a group of techniques that prevent overfitting. In this case, we'll change the predictor from LinearRegression to Ridge, which is a linear regressor with an added tool for keeping model coefficients from getting too big.

Here's a good explanation of what a ridge regression is and why it's important:

[6]:
YouTubeVideo("Q81RR3yKn30")
[6]:
## Generalization

Generalization¶

Notice that we tested the model with a dataset that's *different* from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept **generalization**.  By testing your model with different data than you used to train it, you're checking to see if your model can generalize.  Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

Notice that we tested the model with a dataset that's different from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept generalization. By testing your model with different data than you used to train it, you're checking to see if your model can generalize. Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

## Calculating the Mean Absolute Error for a List of Predictions

Calculating the Mean Absolute Error for a List of Predictions¶

Plots are great for displaying information, but a value that tells you the typical error in a prediction is helpful too. This value is called the **mean absolute error**, and it's defined as the average value of the magnitude of the error in the predictions. The closer the MAE is to `0`, the better our model fits the data. scikit-learn will do this for you if you pass it the price predictions from your regression model and the actual prices from the test data set. Let's see how our `lr` model did by comparing its predictions to the true values in `mexico_city_labels`.

Plots are great for displaying information, but a value that tells you the typical error in a prediction is helpful too. This value is called the mean absolute error, and it's defined as the average value of the magnitude of the error in the predictions. The closer the MAE is to 0, the better our model fits the data. scikit-learn will do this for you if you pass it the price predictions from your regression model and the actual prices from the test data set. Let's see how our lr model did by comparing its predictions to the true values in mexico_city_labels.

[7]:
mean_absolute_error(price_pred_example, mexico_city_labels)
[7]:
226209.01327442465
## Access an Attribute of a Trained Model

Access an Attribute of a Trained Model¶

After training a model that fits a straight line to your data, you can now obtain the parameters that fit your line. We're particularly interested in the slope `regr_lr.coef_` and the axis intercept `regr_lr.intercept_`

After training a model that fits a straight line to your data, you can now obtain the parameters that fit your line. We're particularly interested in the slope regr_lr.coef_ and the axis intercept regr_lr.intercept_

[8]:
print(lr.coef_)
[3.45888116]
[9]:
print(lr.intercept_)
309238.5471429155
## Multicollinearity

Multicollinearity¶

When you're creating a linear model that uses many features to make predictions, some of those features can be highly correlated with each other. This isn't a problem that's going to break your model; it will still make predictions and it might have good performance metrics. But it is an issue if you want to interpret the coefficients for your model because it becomes hard to tell which features are truly important. 

When you're creating a linear model that uses many features to make predictions, some of those features can be highly correlated with each other. This isn't a problem that's going to break your model; it will still make predictions and it might have good performance metrics. But it is an issue if you want to interpret the coefficients for your model because it becomes hard to tell which features are truly important.

Let's look at mexico-city-real-estate-1.csv for an example. First we'll import the data.

[10]:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv", usecols=columns)
[10]:
price price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2
2 2700000.0 2748947.10 146154.51 61.0 61.0 44262.295082
3 6347000.0 6462061.92 343571.36 176.0 128.0 49585.937500
4 6870000.0 6994543.16 371882.03 180.0 136.0 50514.705882
5 6500000.0 6617835.61 351853.45 300.0 300.0 21666.666667
6 670000.0 682146.11 36267.97 65.0 65.0 10307.692308
Now let's find the correlations between the columns.

Now let's find the correlations between the columns.

[11]:
mexico_city1.corr()
[11]:
price price_aprox_local_currency price_aprox_usd surface_total_in_m2 surface_covered_in_m2 price_per_m2
price 1.000000 0.333655 0.333655 0.112588 0.371073 0.380879
price_aprox_local_currency 0.333655 1.000000 1.000000 0.118123 0.598506 -0.068775
price_aprox_usd 0.333655 1.000000 1.000000 0.118123 0.598506 -0.068775
surface_total_in_m2 0.112588 0.118123 0.118123 1.000000 0.125032 0.003488
surface_covered_in_m2 0.371073 0.598506 0.598506 0.125032 1.000000 -0.147158
price_per_m2 0.380879 -0.068775 -0.068775 0.003488 -0.147158 1.000000
Let's see what happens when we fit a linear regression model for `surface_covered_in_m2` as a function of `price_aprox_usd` and `price_aprox_local_currency`.

Let's see what happens when we fit a linear regression model for surface_covered_in_m2 as a function of price_aprox_usd and price_aprox_local_currency.

[12]:
    mexico_city1[["price_aprox_usd", "price_aprox_local_currency"]],
[12]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Let's take a look at the coefficients of the model:

Let's take a look at the coefficients of the model:

[13]:
print(lr.coef_)
[6593.61513754 -350.56569568]
Ask yourself: Does it make sense that increasing the price of a property by one US dollar would translate to a 6593 m<sup>2</sup> increase in size? Perhaps, though it seems unlikely. And does it make sense that increasing the price by one Mexican peso would translate to a 350 m<sup>2</sup> *decrease* in size? Definitely not. So while this model may perform well when we evaluate it using metrics like mean absolute error, we can't use it to determine which features actually our target.

Ask yourself: Does it make sense that increasing the price of a property by one US dollar would translate to a 6593 m2 increase in size? Perhaps, though it seems unlikely. And does it make sense that increasing the price by one Mexican peso would translate to a 350 m2 decrease in size? Definitely not. So while this model may perform well when we evaluate it using metrics like mean absolute error, we can't use it to determine which features actually our target.

*References & Further Reading*

References & Further Reading

  • A primer on linear regression
  • More on resampling from the pandas documentation
  • More information on rolling averages
  • More on absolute and mean absolute errors
  • A discussion of the various uses of model fitting in machine learning
  • Wikipedia Page on Multicollinearity
  • Online Article on Multicollinearity
  • Wikipedia Article on Generalization
  • Online Tutorial on Regression with scikit-learn
  • Official scikit-learn Documentation on Linear Models
  • Wikipedia Article on Logarithm Function
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

<font size="+3"><strong>Machine Learning: Core Concepts</strong></font>

Machine Learning: Core Concepts

# Model Types

Model Types¶

**Linear regression** is a way to predict the value of some a target variable by fitting a line that best describes the relationship between **Big X** and **little y** for the values we already have. If you remember `y = mx + b` from Algebra, it's the same thing; the `y` is the intercept, and the `b` is the beta coefficient. The beta coefficient tells us what change we can expect to see in `X` for every one-unit increase in `y`. If that doesn't seem familiar to you, don't worry about it; we'll give you everything you need to know.

Linear regression is a way to predict the value of some a target variable by fitting a line that best describes the relationship between Big X and little y for the values we already have. If you remember y = mx + b from Algebra, it's the same thing; the y is the intercept, and the b is the beta coefficient. The beta coefficient tells us what change we can expect to see in X for every one-unit increase in y. If that doesn't seem familiar to you, don't worry about it; we'll give you everything you need to know.

# Statistical Concepts 

Statistical Concepts¶

## Cost Functions

Cost Functions¶

When we train a model, we're solving an optimization problem. We provide training data to an algorithm and tell it to find the model or model parameters that best fit the data. But how can the algorithm judge what the "best" fit is? What criteria should it use?

A cost function (sometimes also called a loss or error function) is a mathematical formula that provides the score by which the algorithm will determine the best fit. Generally, the the goal is to minimize the cost function and get the lowest score. For linear models, these functions measure distance, and the model tries to to get the closest fit to the data. For tree-based models, they measure impurity, and the model tries to get the most terminal nodes.

## Residuals

Residuals¶

When we perform any type of regression analysis, we end up with a line of best fit. Because our data comes from the real world, it tends to be a little bit messy, so the data points usually don’t fall exactly on this line. Most of the time, they’re are scattered around it, and a residual is the vertical distance between each individual data point and the regression line. Each data point has only one residual which can be positive if it’s above the regression line, negative if it’s below the regression line, or zero if the line passes directly through the point. Think of it like this: the model describes theoretical line. That line doesn't really exist outside the model. The residuals, however, are true values; they represent the actual data that came from real observations.

## Performance Metrics

Performance Metrics¶

In statistics, an error is the difference between a measurement and reality. There may not be any difference at all, but there's usually something not quite right, and we need to account for that in our model. To do that, we need to figure out the mean absolute error (MAE). Absolute error is the error in a single measurement, and mean absolute error is the average error over the course of several measurements.

Imagine that you're buying some bananas. The store charges for fruit based on weight, so you put your bananas on a scale before you head off to pay for them. The scale says they weigh 1.2 kilos, but your innate sense of weight tells you that they actually weight 0.9 kilos. The absolute error in that measurement would be 0.3 kilos. It can go the other way too: maybe you know the bananas weight 1.2 kilos, but the scale says they were 0.9 kilos. In that case, the absolute error would still be 0.3 kilos, because even though the numerical difference is -0.3, absolute values are always positive; all you have to do is disregard the - sign.

Let's keep going: you're sure the bananas don't weight 1.2 kilos, so you weigh them again. This time, the scale says 1.0 kilos. That's still wrong, so you weigh the bananas a third time, and now the scale says 2.3 kilos. Since the actual weight of your bananas hasn't changed, you now have a set of three absolute errors: 0.3, 0.1, and 1.4. If we average those errors together, we get 0.6, which is the mean absolute error for your banana data.

# Data Concepts

Data Concepts¶

## Leakage

Leakage¶

Leakage is the use of data in training your model that would not be typically be available when making predictions. For example, suppose we want to predict property prices in USD but include property prices in Mexican Pesos in our model. If we assume a fixed exchange rate or a nearly constant exchange rate, then our model will have a low error on the training data, but this will not be reflective of its performance on real world data.

## Imputation

Imputation¶

Datasets are often incomplete or missing entries. If the dataset is large and the missing entries are few, then the missing entries aren't all that important. But sometimes, it might be useful to include data with missing entries by finding a way to impute the missing entries in a row or column of a DataFrame. For example, you might use extrapolation when the data points have a pattern, or you might approximate the missing values by mean values.

## Generalization

Generalization¶

Notice that we tested the model with a dataset that's different from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept generalization. By testing your model with different data than you used to train it, you're checking to see if your model can generalize. Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

# Model Concepts

Model Concepts¶

## Hyperparameters

Hyperparameters¶

When we instantiate an estimator, we can pass keyword arguments that will dictate its structure. These arguments are called **hyperparameters**. For example, when we defined our decision tree estimator, we chose how many layers the tree would have using the `max_depth` keyword. This is in contrast to **parameters**, which are the numbers that our model uses to make predictions based on features. Parameters are optimized during the training process based on data and input features. They keep changing during training to fit the data and only the best performed ones were selected.  Hyperparameters values are set before training begins and will not be changed during the training process. Pretty much all models have hyperparameters. Even a simple linear regressor has a hyperparameter: `fit_intercept`. Here are some common examples for Hyperparameters:

When we instantiate an estimator, we can pass keyword arguments that will dictate its structure. These arguments are called hyperparameters. For example, when we defined our decision tree estimator, we chose how many layers the tree would have using the max_depth keyword. This is in contrast to parameters, which are the numbers that our model uses to make predictions based on features. Parameters are optimized during the training process based on data and input features. They keep changing during training to fit the data and only the best performed ones were selected. Hyperparameters values are set before training begins and will not be changed during the training process. Pretty much all models have hyperparameters. Even a simple linear regressor has a hyperparameter: fit_intercept. Here are some common examples for Hyperparameters:

  • The imputation strategy used for missing data.
  • The number of trees in a random forest model.
  • The number of jobs to run in parallel when fitting and predicting.
# References and Further Reading

References and Further Reading¶

- [Parameters and Hyperparameters in Machine Learning and Deep Learning](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac) 
  • Parameters and Hyperparameters in Machine Learning and Deep Learning
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Machine Learning: Unsupervised Learning</strong></font>

Machine Learning: Unsupervised Learning

xxxxxxxxxx
Machine Learning falls into two classes, **supervised** learning and **unsupervised** learning. In supervised learning, we're trying to learn a predictive relationship between **features** of our data and some sort of  **target**. In unsupervised learning, we want to find trends in our features without using any target. 

Machine Learning falls into two classes, supervised learning and unsupervised learning. In supervised learning, we're trying to learn a predictive relationship between features of our data and some sort of target. In unsupervised learning, we want to find trends in our features without using any target.

A human example of supervised learning would be borrowing books from a library on mathematics and geography. By reading different books belonging to each topic, we learn what symbols, images, and words are associated with math, and which are associated with geography. A similar unsupervised task would be to borrow many books without knowing their subject. We can see some books contain similar images (maps) and some books contain similar symbols (e.g. the Greek letters ΣΣ and 𝜋π). We say the books containing maps are similar and that they are different from the books containing Greek letters. Crucially, we do not know what the books are about, only that they are similar or different.

xxxxxxxxxx
# Clustering

Clustering¶

xxxxxxxxxx
Clustering is a branch of unsupervised machine learning where the goal is to identify groups or clusters in your data set without the use of labels. Clustering should not be considered the same as classification. We are not trying make predictions on observations from a set of pre-defined classes. In clustering, you are identifying a set of similar data points and calling the resulting set a cluster.

Clustering is a branch of unsupervised machine learning where the goal is to identify groups or clusters in your data set without the use of labels. Clustering should not be considered the same as classification. We are not trying make predictions on observations from a set of pre-defined classes. In clustering, you are identifying a set of similar data points and calling the resulting set a cluster.

Let's consider an example of clustering. You may have a data set characterizing your customers like demographic information and personal preferences. A supervised machine learning task would be to predict which class a person belongs to: customers who will purchase your product and customers who won't. In contrast, an unsupervised machine learning task would be to identify several groups or types of customers. With these groups identified, you can analyze the groups and build profiles describing the groups. For example, one group tends to include people from the ages 20 to 25 who like the outdoors. With these profiles, you can pass that information and analysis to the marketing team to create different advertisements to best attract each group.

xxxxxxxxxx
## k-means Clustering

k-means Clustering¶

xxxxxxxxxx
The k-means algorithms seeks to find $K$ clusters within a data set. The clusters are chosen to reduce the distance of the data points within each cluster. The objective function is

The k-means algorithms seeks to find 𝐾K clusters within a data set. The clusters are chosen to reduce the distance of the data points within each cluster. The objective function is

min𝐶𝑘∑𝑘∑𝑋𝑗∈𝐶𝑘‖𝑋𝑗−𝜇𝑘‖2.minCk∑k∑Xj∈Ck‖Xj−μk‖2.

where 𝜇𝑘μk is defined as the centroid of a cluster. Each cluster has a unique 𝜇𝑘μk, which equals to the mean of each feature/components of all points assigned to the cluster. We use the following equation to calculate 𝜇𝑘μk:

𝜇𝑘=1|𝐶𝑘|∑𝑋𝑗∈𝐶𝑘𝑋𝑗,μk=1|Ck|∑Xj∈CkXj,

here |𝐶𝑘||Ck| is the number of points in cluster 𝑘k. The training algorithm for k-means is iterative. First, we assign 𝑘k random starting locations as each cluster's centroid, then we:

  1. Assign each point to a cluster based on which cluster centroid it's closest to
  2. Calculate a new centroid for each cluster based its the datapoints
  3. Repeat the above steps until convergence is met

To see how the algorithm works in practice, let's quickly go through this example below:

[1]:
df = wrangle("data/SCFP2019-textbook.csv.gz")
(1128, 14)
[1]:
HHSEX AGE EDUC MARRIED KIDS INCOME WAGEINC TURNFEAR STOCKS HOUSES DEBT NETWORTH NFIN ASSET
0 1 50 8 1 3 38688.480260 11199.296917 1 0 0.0 12200.0 -6710.0 3900.0 5490.0
1 1 50 8 1 3 37670.362358 11199.296917 1 0 0.0 12600.0 -4710.0 6300.0 7890.0
2 1 50 8 1 3 38688.480260 12217.414819 1 0 0.0 14100.0 -2510.0 10000.0 11590.0
3 1 50 8 1 3 38688.480260 12217.414819 1 0 0.0 15400.0 -5715.0 8100.0 9685.0
21 1 45 2 1 0 38688.480260 30543.537047 1 0 0.0 0.0 9860.0 9800.0 9860.0
xxxxxxxxxx
In the following example, we'll use `"INCOME"` and `"HOUSES"` to demonstrate the k-means clustering algorithm. First, we select the two features from our DataFrame as training set.

In the following example, we'll use "INCOME" and "HOUSES" to demonstrate the k-means clustering algorithm. First, we select the two features from our DataFrame as training set.

[2]:
X = df[["INCOME", "HOUSES"]]
(1128, 2)
[2]:
INCOME HOUSES
0 38688.480260 0.0
1 37670.362358 0.0
2 38688.480260 0.0
3 38688.480260 0.0
21 38688.480260 0.0
xxxxxxxxxx
Next, we apply the k-means clustering algorithm from `sklearn`. We can choose the number of clusters to by defining `n_clusters`. In this example, we will show the results with 3 clusters. Note the algorithm starts with randomly assigned initial centroid positions, so defining a `random_state` will make sure the randomly assigned positions stay the same for repetitive runs. 

Next, we apply the k-means clustering algorithm from sklearn. We can choose the number of clusters to by defining n_clusters. In this example, we will show the results with 3 clusters. Note the algorithm starts with randomly assigned initial centroid positions, so defining a random_state will make sure the randomly assigned positions stay the same for repetitive runs.

[3]:
model = KMeans(n_clusters=3, random_state=42)
[3]:
KMeans(n_clusters=3, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=42)
xxxxxxxxxx
After fitting the data, the model will assign each data point a `label`, indicating which cluster this data point belongs to.

After fitting the data, the model will assign each data point a label, indicating which cluster this data point belongs to.

[4]:
labels = model.labels_
[4]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
xxxxxxxxxx
To visualize the clustering results, we can quickly draw a scatter plot showing the two features. We can use different colors for different clusters.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

To visualize the clustering results, we can quickly draw a scatter plot showing the two features. We can use different colors for different clusters.WQU WorldQuant University Applied Data Science Lab QQQQ

[5]:
    x=df["INCOME"] / 1e6, y=df["HOUSES"] / 1e6, hue=model.labels_, palette="deep"
xxxxxxxxxx
From the scatter plot, we can see the k-means algorithm divides data points mostly based on the `HOUSE` feature. The lower home value data points are assigned to cluster 0, higher home value data points are assigned to cluster 1, while cluster 2 shows data points with home value in the middle.

From the scatter plot, we can see the k-means algorithm divides data points mostly based on the HOUSE feature. The lower home value data points are assigned to cluster 0, higher home value data points are assigned to cluster 1, while cluster 2 shows data points with home value in the middle.

As mentioned in describing the k-means algorithm, centroid plays a very important role in deciding the clustering results. We can extract the final locations of centroid from each cluster, and plot them in the scatter plot.

[6]:
centroids = model.cluster_centers_
[6]:
array([[ 41008.9676637 ,    912.95116773],
       [ 42475.16438463, 127842.10526316],
       [ 41537.63190662,  70697.6744186 ]])
[7]:
    x=df["INCOME"] / 1e6, y=df["HOUSES"] / 1e6, hue=model.labels_, palette="deep"
xxxxxxxxxx
## Clustering Metrics

Clustering Metrics¶

To see whether our clustering algorithm performs well, we need more than a scatter plot. The two common metrics we used are inertia and silhouette score. These metrics will also be helpful in determining the number of clusters to use.

xxxxxxxxxx
### Inertia

Inertia¶

xxxxxxxxxx
 **Inertia** is the within-cluster sum of square distance, which is used in k-means algorithm's objective function. Mathematically, inertia is equal to

Inertia is the within-cluster sum of square distance, which is used in k-means algorithm's objective function. Mathematically, inertia is equal to

∑𝑘∑𝑋𝑗∈𝐶𝑘‖𝑋𝑗−𝜇𝑘‖2,∑k∑Xj∈Ck‖Xj−μk‖2,

where 𝜇𝑘μk is the centroid of cluster 𝑘k and 𝐶𝑘Ck is the set of points assigned to cluster 𝑘k. Basically, the inertia is the sum of the distance of each point to the centroid or center of its assigned cluster. A lower inertia means the points assigned to the clusters are closer to the centroid.

xxxxxxxxxx
We can extract `inertia` from the previous model using the code below:

We can extract inertia from the previous model using the code below:

[8]:
print("Inertia (3 clusters):", inertia)
Inertia (3 clusters): 213568429223.46463
xxxxxxxxxx
### Silhouette Score

Silhouette Score¶

xxxxxxxxxx
**Silhouette Coefficient** is a measure of how dense and separated are the clusters. The silhouette coefficient is a property assigned to each data point. It's equal to

Silhouette Coefficient is a measure of how dense and separated are the clusters. The silhouette coefficient is a property assigned to each data point. It's equal to

𝑏−𝑎max(𝑎,𝑏),b−amax(a,b),

where 𝑎a is the distance between a point and centroid of its assigned cluster; 𝑏b is the distance between the point and the centroid of the nearest neighboring cluster (i.e. the closest cluster the point is not assigned to).

The silhouette coefficient ranges from -1 to 1. If a point is really close to the centroid of its assigned cluster, then 𝑎≪𝑏a≪b and the silhouette coefficient will be approximately equal to 1. If the reverse is true, 𝑎≫𝑏a≫b, then the coefficient will be -1. If the point could have been assigned to either cluster, its coefficient will be 0.

Higher silhouette coefficient means higher density and highly separated clusters. This is because we want to have lower 𝑎a (close to assigned cluster's centroid) and higher 𝑏b (far away from unassigned cluster's centroid). A lower 𝑎a value combined with higher 𝑏b value will produce a higher silhouette score.

xxxxxxxxxx
We can extract calculate the silhouette score using the code below:

We can extract calculate the silhouette score using the code below:

[9]:
print("Silhouette Score (3 clusters):", ss)
Silhouette Score (3 clusters): 0.7519665901248906
xxxxxxxxxx
## Optimizing the Number of Clusters

Optimizing the Number of Clusters¶

We can choose the optimal number of clusters by examining how number of cluster affect inertia and silhouette score. Let's check the following plots showing how inertia and silhouette scores change with respect to number of clusters ranging from 2 to 20.

[10]:
    silhouette_scores.append(silhouette_score(X, model.labels_))
xxxxxxxxxx
Now we have saved the metrics, we can plot them.

Now we have saved the metrics, we can plot them.

[11]:
plt.title("k-means Model: Inertia vs Number of Clusters");
xxxxxxxxxx
Note that inertia will decrease whenever the number of cluster increases. In fact, inertia will go to zero if the number of clusters equals number of data points, because each data point would be its own cluster. But that wouldn't make any practical sense, so we're not looking for the minimum point on the graph. Instead,we want the point where increasing numbers of clusters will not decrease inertia that much anymore. We usually refer to the point that indicating this change of inertia decreasing speed as the **elbow point**. In the example here, the elbow point is at 4-5.

Note that inertia will decrease whenever the number of cluster increases. In fact, inertia will go to zero if the number of clusters equals number of data points, because each data point would be its own cluster. But that wouldn't make any practical sense, so we're not looking for the minimum point on the graph. Instead,we want the point where increasing numbers of clusters will not decrease inertia that much anymore. We usually refer to the point that indicating this change of inertia decreasing speed as the elbow point. In the example here, the elbow point is at 4-5.

We can also plot the silhouette scores:

[12]:
plt.title("k-means Model: Silhouette Score vs Number of Clusters");
xxxxxxxxxx
From the graph, we can see the silhouette score dropped a lot from 2 to 4, and from 5 to 6. Note that a higher silhouette score means denser and more clearly separated clusters, which is what we want. Combining both the inertia plot and the silhouette score plot, we can see the optimal number of cluster should be at 5. 

From the graph, we can see the silhouette score dropped a lot from 2 to 4, and from 5 to 6. Note that a higher silhouette score means denser and more clearly separated clusters, which is what we want. Combining both the inertia plot and the silhouette score plot, we can see the optimal number of cluster should be at 5.

xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Use ASSET and INCOME from the same data set "SCFP2019-textbook.csv.gz", and:

  1. Use k-means to assign the data points to 3 clusters.
  2. Create a scatter plot showing the resulting clusters and their centroids.
  3. Calculate the inertia and the silhouette score.
[13]:
    df = pd.read_csv(filepath)
[14]:
# Plot "ASSET" vs "HOUSES" with hue=label
[15]:
print("Inertia (3 clusters):", inertia)
Inertia (3 clusters): Ellipsis
[16]:
print("Silhouette Score (3 clusters):", ss)
Silhouette Score (3 clusters): Ellipsis
xxxxxxxxxx
# Principal Component Analysis

Principal Component Analysis¶

xxxxxxxxxx
Principal component analysis (PCA) is a dimension reduction technique that takes a data set characterized by a set of possibly correlated features and generates a new set of features that are uncorrelated. It is used as a dimension reduction technique because the new set of uncorrelated features are chosen to be efficient in terms of capturing the variance in the data set.

Principal component analysis (PCA) is a dimension reduction technique that takes a data set characterized by a set of possibly correlated features and generates a new set of features that are uncorrelated. It is used as a dimension reduction technique because the new set of uncorrelated features are chosen to be efficient in terms of capturing the variance in the data set.

Let's examine a case where we have a data set of only two dimensions. In practice, PCA is rarely used when the dimension of the data set is already low. However, it is easier to illustrate the method when we have two or three dimensions.

[17]:
x2 = 2 * x1 + 1 + 0.2 * np.random.randn(500)
xxxxxxxxxx
The data plotted is characterized by two dimensions, but most of the variation does not occur in either of the two dimensions. Most of the points "follow" along the direction plotted in the dashed line. The variables $x_1$ and $x_2$ are highly correlated; as $x_1$ increases, in general, so does $x_2$ and vice versa.

The data plotted is characterized by two dimensions, but most of the variation does not occur in either of the two dimensions. Most of the points "follow" along the direction plotted in the dashed line. The variables 𝑥1x1 and 𝑥2x2 are highly correlated; as 𝑥1x1 increases, in general, so does 𝑥2x2 and vice versa.

Instead of using the original two features, 𝑥1x1 and 𝑥2x2, perhaps we can use a different set of features, 𝜉1ξ1 and 𝜉2ξ2. The first chosen feature 𝜉1ξ1 should be aligned in the direction of greatest variation while the second will be orthogonal to the first. The new axes/dimensions are referred to as principal components. Let's visualize the data set but using the principal components 𝜉1ξ1 and 𝜉2ξ2 rather than the original features.

[18]:
plt.hlines(0, xi_1_min, xi_1_max, linestyles="--")
xxxxxxxxxx
In the figure, we can clearly observe that $\xi_1$ is the dimension with the largest variation. In the PCA algorithm, $\xi_1$ is chosen to capture as much of the variation as possible, with $\xi_2$ picking up the rest of remaining variation. Now, if we want to use one dimension to describe our data, we would keep $\xi_1$ and drop $\xi_2$, ensuring we keep as much of the information in our data set using just one dimension. Further, notice how the new dimensions are not correlated. As we move from lower to higher values of $\xi_1$, $\xi_2$ does not predictability increase or decrease.

In the figure, we can clearly observe that 𝜉1ξ1 is the dimension with the largest variation. In the PCA algorithm, 𝜉1ξ1 is chosen to capture as much of the variation as possible, with 𝜉2ξ2 picking up the rest of remaining variation. Now, if we want to use one dimension to describe our data, we would keep 𝜉1ξ1 and drop 𝜉2ξ2, ensuring we keep as much of the information in our data set using just one dimension. Further, notice how the new dimensions are not correlated. As we move from lower to higher values of 𝜉1ξ1, 𝜉2ξ2 does not predictability increase or decrease.

xxxxxxxxxx
## PCA in `scikit-learn`

PCA in scikit-learn¶

In scikit-learn, dimension reduction algorithms are transformers. The choice of having these algorithms as transformers makes sense since they apply a transformation on the data set. Let's illustrate the syntax for the PCA algorithm in scikit-learn. For most of these algorithms, the data needs to be centered and scaled to work properly. PCA automatically centers the data but does not scale it. StandardScaler is often used for preprocessing the data prior to applying PCA.

[19]:
print("Number of dimension before reduction:", X_scaled.shape[-1])
Number of dimension before reduction: 6
Number of dimension after reduction: 2
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

​x
 

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Time Series: Statistical Models</strong></font>

Time Series: Statistical Models

xxxxxxxxxx
# Autoregression

Autoregression¶

Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works similarly to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.

Cleaning the Data¶

Just like with linear regression, we'll start by bringing in some tools to help us along the way.

[1]:
xxxxxxxxxx
 
import warnings
​
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from arch import arch_model
from IPython.display import YouTubeVideo
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AutoReg
​
warnings.simplefilter(action="ignore", category=FutureWarning)
/opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
/opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
xxxxxxxxxx
Since we'll be working with the `"air-quality"` data again, we need to connect to the server, start our client, and grab the data we need.

Since we'll be working with the "air-quality" data again, we need to connect to the server, start our client, and grab the data we need.

[2]:
 
client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
lagos = db["lagos"]
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Just to make sure we're all on the same page, import all those libraries and get your database up and running. Remember that even though all the examples use the Site 3 data from the lagos collection, the practice sets should use Site 4 data from the lagos collection. Call your database lagos_prac.

[3]:
 
lagos_prac = ...
xxxxxxxxxx
In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to `"timestamp"`:

In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to "timestamp":

[4]:
 
results = lagos.find(
    # Note that the `3` refers to Site 3.
    {"metadata.site": 3, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0},
)
df = pd.DataFrame(list(results)).set_index("timestamp")
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Create a list called results_prac that pulls data from Site 4 in the lagos data, then save it in a DataFrame called df_prac with the index "timestamp".

[ ]:
 
​
​
xxxxxxxxxx
## Localizing the Timezone

Localizing the Timezone¶

Because MongoDB stores all timestamps in UTC, we need to figure out a way to localize it. Having timestamps in UTC might be useful if we were trying to predict some kind of global trend, but since we're only interested in what's happening with the air in Lagos, we need to change the data from UTC to Africa/Lagos. Happily, pandas has a pair of tools to help us out: tz_localize and tz_convert. We use those methods to transform our data like this:

[5]:
 
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")
xxxxxxxxxx
## Resampling Data

Resampling Data¶

The most important kind of data in our time-series model is the data that deals with time. Our "timestamp" data tells us when each reading was taken, but in order to create a good predictive model, we need the readings to happen at regular intervals. Our data doesn't do that, so we need to figure out a way to change it so that it does. The resample method does that for us.

Let's resample our data to create 1-hour reading intervals by aggregating using the mean:

[6]:
 
# `"1H"` represents our one-hour window
df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()
xxxxxxxxxx
Notice the second half of the code:

Notice the second half of the code:

fillna(method="ffill").to_frame()

That tells the model to forward-fill any empty cells with imputed data. Forward-filling means that the model should start imputing data based on the closest cell that actually has data in it. This helps to keep the imputed data in line with the rest of the dataset.

Adding a Lag¶

We've spent some time elsewhere thinking about how two sets of data — apartment price and location, for example — compare to each other, but we haven't had any reason to consider how a dataset might compare to itself. If we're predicting the future, we want to know how good our prediction will be, so it might be useful to build some of that accountability into our model. To do that, we need to add a lag.

Lagging data means that we're adding a delay. In this case, we're going to allow the model to test itself out by comparing its predictions with what actually happened an hour before. If the prediction and the reality are close, then it's a good model; if they aren't, then the model isn't a very good one.

So, let's add a one-hour lag to our dataset:

[7]:
 
# In `shift(1), the `1` is the lagged interval.
df["P2.L1"] = df["P2"].shift(1)
xxxxxxxxxx
Finally, let's drop our null values:

Finally, let's drop our null values:

[8]:
 
df.dropna(inplace=True)
y = df["P2"].resample("1H").mean().fillna(method="ffill")
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Clean the Site 2 data from lagos, and save it as a Series called y_prac.

[9]:
xxxxxxxxxx
 
df_prac.index = ...
df_prac = ...
df_prac["P2.L1"] = ...
​
​
y_prac = ...
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [9], line 1
----> 1 df_prac.index = ...
      2 df_prac = ...
      3 df_prac["P2.L1"] = ...

NameError: name 'df_prac' is not defined
xxxxxxxxxx
## Exploring the Data

Exploring the Data¶

xxxxxxxxxx
### Time Series Line Plot

Time Series Line Plot¶

xxxxxxxxxx
Example of 

Example of

xxxxxxxxxx
### Creating an ACF Plot

Creating an ACF Plot¶

Let's make an ACF plot using our y Series.

[ ]:
xxxxxxxxxx
 
fig1, ax = plt.subplots(figsize=(15, 6))
# This is where to include your Series
​
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");
xxxxxxxxxx
Each of the dots on our plot represents a correlation coefficient. The first data point in the top left of the graph tells us that at time-step 0, the correlation coefficient was 1, meaning that there was a perfect correlation. That makes sense, because you can't lag from time-step 0, so the coefficient can't be anything other than 1. But, starting at hour 1, the coefficient drops precipitously, and we see our autocorrelation coefficients slowly decay over time. As our lag recedes further into the past, the correlations break down; a prediction you made five hours ago about what's happening right now is going to be a lot more reliable than a prediction you made 96 hours ago.

Each of the dots on our plot represents a correlation coefficient. The first data point in the top left of the graph tells us that at time-step 0, the correlation coefficient was 1, meaning that there was a perfect correlation. That makes sense, because you can't lag from time-step 0, so the coefficient can't be anything other than 1. But, starting at hour 1, the coefficient drops precipitously, and we see our autocorrelation coefficients slowly decay over time. As our lag recedes further into the past, the correlations break down; a prediction you made five hours ago about what's happening right now is going to be a lot more reliable than a prediction you made 96 hours ago.

The light blue shape across the bottom of the graph represents the confidence interval, or the extent to which we can be sure that our estimated correlations reflect the correlations that exist in reality. By default, this is set to 95%. Data points which fall either above or below the shape are likely not due to chance, and those which fall inside the shape are likely due to chance. It looks like all our data is the result of some kind of effect, so we're good to go.

Practice

Try it yourself! Make an ACF plot called fig2 using your y_prac Series.

[ ]:
xxxxxxxxxx
 
fig2, ax = ...
​
​
xxxxxxxxxx
### Creating a PACF Plot

Creating a PACF Plot¶

Let's make a PACF plot using our y Series.

[ ]:
xxxxxxxxxx
 
fig1, ax = plt.subplots(figsize=(15, 6))
​
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");
xxxxxxxxxx
Aha! This looks very different. There are two things to notice here:

Aha! This looks very different. There are two things to notice here:

First, we now have lots of data points that we can be relatively certain aren't due to chance, but we also have lots of data points inside the blue shape at the bottom, indicating that some of our data points are indeed due to chance. That's not necessarily a problem, but it's something useful to keep in mind.

Second, recognize that even though the amplitude of the points on our graph has been significantly reduced, the trend has remained essentially the same: Strong positive correlations at the beginning, with the effect decaying over time. We would expect to see that, because the farther out into the future our predictions go, the less accurate they become.

Practice

Try it yourself! Make an PACF plot using your y_prac Series.

[ ]:
xxxxxxxxxx
 
fig2, ax = ...
​
​
xxxxxxxxxx
## Working with Rolling Windows

Working with Rolling Windows¶

Rolling window is an important concept for time series analysis. We first define a window size, like 7 days, three months, etc. Then we calculate some statistics taking data from each window sequentially throughout the time series. For example, if I want to calculate a three-month rolling sum with the time series data below:

Month sales
2022-01 10
2022-02 20
2022-03 25
2022-04 15
2022-05 20
2022-06 30

The three-month rolling sum would be

Rolling Months Rolling sum sales
2022-01,02,03 55
2022-02,03,04 60
2022-03,04,05 60
2022-04,05,06 65
xxxxxxxxxx
Rolling window statistics are very helpful in smoothing noisy data when making time series predictions. Let's see it with an example. Since we're interested in making predictions about the air quality in Lagos, it would be helpful to understand the rolling average for the PM 2.5 readings with a line plot. To keep things manageable, we'll set our window-size to one week.

Rolling window statistics are very helpful in smoothing noisy data when making time series predictions. Let's see it with an example. Since we're interested in making predictions about the air quality in Lagos, it would be helpful to understand the rolling average for the PM 2.5 readings with a line plot. To keep things manageable, we'll set our window-size to one week.

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
# `168` is the number of hours in a week.
df["P2"].rolling(168).mean().plot(ax=ax);
xxxxxxxxxx
Even though there are lots of peaks and valleys here, we're starting to see an emerging trend.

Even though there are lots of peaks and valleys here, we're starting to see an emerging trend.

We can make the same graph using pandas, like this:

Practice

Try it yourself! Make a line plot that shows the weekly rolling average of the P2 values in the Site 2 dataset.

[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
Besides rolling sum and rolling average, rolling window statistics can be applied to a lot of other statistics depends on the problem you are facing. In the example below, when we use **GARCH** model to analyze stock prices, we can use rolling window to calculate standard deviation.

Besides rolling sum and rolling average, rolling window statistics can be applied to a lot of other statistics depends on the problem you are facing. In the example below, when we use GARCH model to analyze stock prices, we can use rolling window to calculate standard deviation.

xxxxxxxxxx
### Splitting the Data in pandas

Splitting the Data in pandas¶

The last thing to do in our data exploration is to split our data into training and test sets. For linear regression, we used an 80/20 split, where we used 80% of the data was our training set, and 20% of it was our test set. This time, we're going to expand the test set to 95%, and decrease the test set to %5 to bring it into line with statsmodels default confidence interval. This is important, because we'll need to use as much training data as we can if our model is going to accurately predict what's going to come next.

[ ]:
xxxxxxxxxx
 
cutoff_test1 = int(len(y) * 0.95)
​
y_train = y.iloc[:cutoff_test1]
y_test = y.iloc[cutoff_test1:]
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Create a cutoff called cutoff_test2, split the y_pracSeries into training and test sets, making sure to set the cutoff to 0.95.

[ ]:
xxxxxxxxxx
 
cutoff_test2 = ...
​
y_prac_train = ...
y_prac_test = ...
xxxxxxxxxx
## Building the Model

Building the Model¶

xxxxxxxxxx
### Baseline

Baseline¶

First, let's calculate the baseline MAE for our model.

[ ]:
xxxxxxxxxx
 
y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
​
print("Mean P2 Reading:", round(y_train_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Calculate the baseline mean and MAE for the y_prac Series.

[ ]:
xxxxxxxxxx
 
y_prac_train_mean = ...
y_prac_pred_baseline = ...
mae_baseline_prac = ...
​
xxxxxxxxxx
### Iterating

Iterating¶

Before we can go any further, we need to instantiate an autoregression model based on our y training data. We'll call the model model.

[ ]:
xxxxxxxxxx
 
model = AutoReg(y_train, lags=24, old_names=False).fit()
xxxxxxxxxx
Notice that, unlike our linear regression model which we built using scikit-learn, we're combining instantiation and fitting into one step; statsmodels includes that ability in its `AutoReg` method.

Notice that, unlike our linear regression model which we built using scikit-learn, we're combining instantiation and fitting into one step; statsmodels includes that ability in its AutoReg method.

Practice

Try it yourself! Create and fit an autoregression model called model_prac.

[ ]:
xxxxxxxxxx
 
model_prac = ...
xxxxxxxxxx
Autoregression models need us to generate **in-sample predictions** in order to calculate the MAE of our training data. In-sample predictions use data that's already part of our sample. That's to distinguish it from out-of-sample predictions, which we'll talk about a bit later. The statsmodels library includes a method called [`predict`](https://www.statsmodels.org/stable/examples/notebooks/generated/predict.html) that can help us here. Above, the `AutoReg` method includes this line:

Autoregression models need us to generate in-sample predictions in order to calculate the MAE of our training data. In-sample predictions use data that's already part of our sample. That's to distinguish it from out-of-sample predictions, which we'll talk about a bit later. The statsmodels library includes a method called predict that can help us here. Above, the AutoReg method includes this line:

old_names=False

The False value here tells the model that it can use in-sample lagged values to make predictions; if the value had been True, the model would have to look elsewhere to make its predictions.

Here's how to generate in-sample predictions:

[ ]:
xxxxxxxxxx
 
y_pred = model.predict().dropna()
xxxxxxxxxx
Once we've done that, we can calculate the MAE of the predictions in our training set.

Once we've done that, we can calculate the MAE of the predictions in our training set.

[ ]:
xxxxxxxxxx
 
training_mae = mean_absolute_error(y_train.loc[y_pred.index], y_pred)
print("Training MAE:", training_mae)
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Generate in-sample predictions using y_prac, and find the MAE for your y_prac training data. Print the result.

[ ]:
xxxxxxxxxx
 
y_prac_pred = ...
training_mae_prac = mean_absolute_error(
    y_prac_train.loc[y_prac_pred.index], y_prac_pred
)  # REMOVERHS
​
xxxxxxxxxx
### Residuals

Residuals¶

We're going to use our model's residuals to make some visualizations, but first, we need to calculate what those residuals are.

[ ]:
xxxxxxxxxx
 
y_train_resid = y_train - y_pred
xxxxxxxxxx
Now we can make a line plot of our model's residuals.

Now we can make a line plot of our model's residuals.

[ ]:
xxxxxxxxxx
 
fig1, ax = plt.subplots(figsize=(15, 6))
y_train_resid.plot(ax=ax);
xxxxxxxxxx
The ideal residual plot has a random set of datapoints spread evenly on both sides of the line. The plot we just made actually looks pretty good; there are some significant outliers, but, on the whole, the bars describe an even band of values, which is what we're looking for.

The ideal residual plot has a random set of datapoints spread evenly on both sides of the line. The plot we just made actually looks pretty good; there are some significant outliers, but, on the whole, the bars describe an even band of values, which is what we're looking for.

xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Calculate the residuals for y_prac and visualize them on a line plot called fig2.

[ ]:
xxxxxxxxxx
 
y_prac_train_resid = ...
fig2, ax = ...
​
xxxxxxxxxx
Let's also take a look at a histogram of the residuals to help us see how they're distributed.

Let's also take a look at a histogram of the residuals to help us see how they're distributed.

[ ]:
xxxxxxxxxx
 
y_train_resid.hist();
xxxxxxxxxx
Remember, when we make histograms, we're trying to answer two questions: 

Remember, when we make histograms, we're trying to answer two questions:

1.) Is it a normal distribution? 2.) Are there any outliers?

For our histogram, that middle bar is pretty tall, but the shape described by all the bars looks like a normal distribution (albeit a stretched one), so the answer to the first question is "yes." Outliers are values that fall beyond the shape of a normal distribution, and it doesn't look like we have any of those, so the answer to the second question is "no." Those are the answers we're looking for, so let's move on to the next step.

xxxxxxxxxx
### ACF Plots

ACF Plots¶

We're going to make an ACF plot to see how much variation there is in the dataset.

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
plot_acf(y_train_resid.dropna(), ax=ax);
xxxxxxxxxx
At first, this might seem wrong, but we're actually looking for a mostly-flat graph here. This is an indication that our model describes all the **seasonality**, or regular changes, in our data. In other words, this graph is exactly what we're looking for.

At first, this might seem wrong, but we're actually looking for a mostly-flat graph here. This is an indication that our model describes all the seasonality, or regular changes, in our data. In other words, this graph is exactly what we're looking for.

Practice

Try it yourself! Calculate the make a histogram and an ACF plot of the y_prac data.

[ ]:
xxxxxxxxxx
 
​
[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
## Evaluating the Model

Evaluating the Model¶

Now that we've built an autoregression model that seems to be working pretty well, it's time to evaluate it. We've already established that the model works well when compared to itself, but what about how well it works when we start looking outside our original dataset?

xxxxxxxxxx
### Out-of-Sample Predictions

Out-of-Sample Predictions¶

To look outside the data, we need to create a new set of predictions. The process here is very similar to the way we made our baseline predictions. We're still using predict, but we're using the test data instead of the train data.

[ ]:
xxxxxxxxxx
 
y_pred_test = model.predict(y_test.index.min(), y_test.index.max())
xxxxxxxxxx
Now that we have a prediction, we can calculate the MAE of our out-of-sample data.

Now that we have a prediction, we can calculate the MAE of our out-of-sample data.

[ ]:
xxxxxxxxxx
 
test_mae = mean_absolute_error(y_test, y_pred_test)
print("Test MAE 1:", test_mae)
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Generate out-of-sample predictions using your y_prac data and model_prac, calculate the MAE, and print the result.

[ ]:
xxxxxxxxxx
 
y_prac_pred_test = model_prac.predict(
    y_prac_test.index.min(), y_prac_test.index.max()
)  # REMOVERHS
test_mae_prac = ...
​
xxxxxxxxxx
Now that we have some out-of-sample predictions, we can compare it to our in-sample predictions using a line plot. The first step there is to create a new DataFrame called `test1_predictions` with two columns: one for the `y_test` data (the true data) and one for the `y_pred` (the predicted data). It's always a good idea to print the first five rows of a new DataFrame, just to make sure it looks the way it should.

Now that we have some out-of-sample predictions, we can compare it to our in-sample predictions using a line plot. The first step there is to create a new DataFrame called test1_predictions with two columns: one for the y_test data (the true data) and one for the y_pred (the predicted data). It's always a good idea to print the first five rows of a new DataFrame, just to make sure it looks the way it should.

[ ]:
xxxxxxxxxx
 
test1_predictions = pd.DataFrame(
    {"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index
)
test1_predictions.head()
xxxxxxxxxx
That looks correct, so we can move on to our line plot.

That looks correct, so we can move on to our line plot.

[ ]:
xxxxxxxxxx
 
fig = px.line(test1_predictions, labels={"value": "P2"})
fig.show()
xxxxxxxxxx
This looks kind of strange, but it's actually exactly what we would expect to see. At the beginning, the `y_pred` data has a fair amount of predictive power, but, as time goes on, the predictions become less and less accurate. It's kind of like what happened with our ACF plots, only in reverse. Last time, the model lost its predictive power as the lag increased. Here, the model loses its predictive power as the horizon — how far away from the present your predictions are — increases.  But don't worry! We'll fix it in a second.

This looks kind of strange, but it's actually exactly what we would expect to see. At the beginning, the y_pred data has a fair amount of predictive power, but, as time goes on, the predictions become less and less accurate. It's kind of like what happened with our ACF plots, only in reverse. Last time, the model lost its predictive power as the lag increased. Here, the model loses its predictive power as the horizon — how far away from the present your predictions are — increases. But don't worry! We'll fix it in a second.

Practice

In the meantime, try it yourself! Make a DataFrame with columns for y_prac_test and y_prac_pred, and print the result. Then, make a line plot that shows the relationship between the two variables.

[ ]:
xxxxxxxxxx
 
​
[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
### Walk-forward Validation

Walk-forward Validation¶

Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past. Let's see how it works.

[ ]:
xxxxxxxxxx
 
%%capture
# First, we define a walk-forward variable
y_pred_wfv = pd.Series()
# Then, we define a variable that takes into account what's happened in the past
history = y_train.copy()
# The `for` loop tells the model what to do with those variables.
for i in range(len(y_test)):
    # Here's where we generate the actual AR model
    r = AutoReg(history, 24, old_names=False).fit()
    # Now we're using `forecast` to create our next prediction
    next_pred = r.forecast()
    # We're adding the next prediction to the list
    y_pred_wfv = y_pred_wfv.append(next_pred)
    # And finally updating `history` to take into account the new observation
    history = history.append(y_test[next_pred.index])
xxxxxxxxxx
You'll notice that we're using the same `AutoReg` method we used before, only this time, we're using the `y_train` data. Also like before, the `24` is telling the model how many hours it should pay attention to. If you change that number, the MAE will change too.

You'll notice that we're using the same AutoReg method we used before, only this time, we're using the y_train data. Also like before, the 24 is telling the model how many hours it should pay attention to. If you change that number, the MAE will change too.

Speaking of the MAE, let's calculate it and see what we've got.

[ ]:
xxxxxxxxxx
 
test1_mae = mean_absolute_error(y_test, y_pred_wfv)
print("Test MAE 1 (walk forward validation):", round(test1_mae, 2))
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Perform a walk-forward validation of your model using the y_prac_train data. Then, calculate the MAE and print the result. Note that because we're using %%capture in the validation cell, you'll need to create a new cell for your MAE calculation.

[ ]:
xxxxxxxxxx
 
%%capture
​
y_prac_pred_wfv = ...
history_prac = ...
​
[ ]:
xxxxxxxxxx
 
test2_mae = ...
​
xxxxxxxxxx
## Communicating the Results

Communicating the Results¶

In machine learning, the model's parameters are the parts of the model that are learned from the training data. There are also hyperparameters, which we'll discuss in the next module. For now, just know that parameters come from inside the model, and hyperparameters are specified outside the model.

So, let's print the parameters of our validated model and see what it looks like.

[ ]:
xxxxxxxxxx
 
print(model.params)
xxxxxxxxxx
That looks pretty good, but showing it in a line plot would be much better.

That looks pretty good, but showing it in a line plot would be much better.

[ ]:
xxxxxxxxxx
 
test1_predictions = pd.DataFrame(
    {"y_test": y_test, "y_pred": y_pred_wfv}, index=y_test.index
)
fig = px.line(test1_predictions)
fig.show()
xxxxxxxxxx
That looks much better! Now our predictions are actually tracking the `test` data, just like they did in the linear regression model.

That looks much better! Now our predictions are actually tracking the test data, just like they did in the linear regression model.

Practice

Try it yourself! Access the parameters of model_prac, put y_prac_test and y_prac_pred_wfv into the test2_predictions DataFrame, and create a line plot using plotly express.

[ ]:
xxxxxxxxxx
 
​
[ ]:
xxxxxxxxxx
 
​
​
xxxxxxxxxx
# ARMA Models & Hyperparameters

ARMA Models & Hyperparameters¶

xxxxxxxxxx
**ARMA** stands for Auto Regressive Moving Average, and it's a special kind of **time-series** analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them *endogenous shocks* — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust. 

ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.

Regardless of the size of the shock, ARMA models can still predict the future. All we need to make that work is data.

xxxxxxxxxx
## Cleaning the Data

Cleaning the Data¶

As always, we need to import all the tools we'll need to make our model.

[ ]:
xxxxxxxxxx
 
import time
import warnings
​
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import seaborn as sns
from pymongo import MongoClient
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
​
warnings.filterwarnings("ignore")
xxxxxxxxxx
And then we need to get our database client up and running.

And then we need to get our database client up and running.

[ ]:
xxxxxxxxxx
 
client = MongoClient(host="localhost", port=27017)
db = client["air-quality"]
lagos = db["lagos"]
xxxxxxxxxx
Then, we need to clean our data. All the examples will use data from Site 3; all the practice sets will use Site 2. If you need a refresher on how all those methods work, refer back to the Autoregression notebook.

Then, we need to clean our data. All the examples will use data from Site 3; all the practice sets will use Site 2. If you need a refresher on how all those methods work, refer back to the Autoregression notebook.

[ ]:
xxxxxxxxxx
 
results = lagos.find(
    # Note that the `3` refers to Site 3.
    {"metadata.site": 3, "metadata.measurement": "P2"},
    projection={"P2": 1, "timestamp": 1, "_id": 0},
)
df = pd.DataFrame(list(results)).set_index("timestamp")
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")
df = df[df["P2"] < 500]
df["P2.L1"] = df["P2"].shift(1)
df.dropna(inplace=True)
y = df["P2"].resample("1H").mean().fillna(method="ffill")
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Get your client up and running and call your database db_prac. Create a variable called results_prac, and read in a collection called lagos_prac using data from Site 2. Save it as a Series called y_prac.

[ ]:
xxxxxxxxxx
 
db_prac = ...
lagos_prac = ...
​
df_prac = ...
df_prac.index = ...
df_prac = ...
df_prac["P2.L1"] = ...
​
y_prac = ...
xxxxxxxxxx
## Exploring the Data

Exploring the Data¶

Just like we did with AR, we'll start by exploring the data. Let's make a histogram.

[ ]:
xxxxxxxxxx
 
y.hist();
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Make a histogram using y_prac.

[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
This is what the data looks like when our sample is 1-hour intervals, but we might want to be able to quickly change our sample to other intervals of time. First, we'll create a function called `wrangle`, and then add an **argument**. In Python, arguments tell the function what to do. This function already has an argument called `collection`, so we'll need to add another to make resampling work. We'll call that argument `resamp_pd`. <span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

This is what the data looks like when our sample is 1-hour intervals, but we might want to be able to quickly change our sample to other intervals of time. First, we'll create a function called wrangle, and then add an argument. In Python, arguments tell the function what to do. This function already has an argument called collection, so we'll need to add another to make resampling work. We'll call that argument resamp_pd. WQU WorldQuant University Applied Data Science Lab QQQQ

[ ]:
xxxxxxxxxx
 
# Here's where the new argument goes. We're setting the default value to `"1H"`.
def wrangle(lagos, resamp_pd="1H"):
    results = lagos.find(
        # Note that the `3` refers to Site 3.
        {"metadata.site": 3, "metadata.measurement": "P2"},
        projection={"P2": 1, "timestamp": 1, "_id": 0},
    )
    df = pd.DataFrame(list(results)).set_index("timestamp")
    df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")
    df["P2.L1"] = df["P2"].shift(1)
    df.dropna(inplace=True)
    return y
xxxxxxxxxx
Now let's change `"1H"` to `"1D"` and see what happens.

Now let's change "1H" to "1D" and see what happens.

[ ]:
xxxxxxxxxx
 
y = wrangle(lagos, resamp_pd="1D")
print(y)
xxxxxxxxxx
As you can see on the left side of the table, the samples are now at one day intervals, which is exactly what we wanted!

As you can see on the left side of the table, the samples are now at one day intervals, which is exactly what we wanted!

Let's make a new histogram to see if changing the sampling interval made a difference in the data.

[ ]:
xxxxxxxxxx
 
y.hist();
xxxxxxxxxx
This looks pretty different! It's always nice to have a diversified dataset.

This looks pretty different! It's always nice to have a diversified dataset.

Practice

Try it yourself! Define a function called wrangle_prac run it, and print the results of y_prac. Then, create a new histogram from y_prac.

[ ]:
xxxxxxxxxx
 
​
print(y_prac)
[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
Like with our AR model, we need to create ACF and PACF plots to see what's happening with the correlation coefficients.

Like with our AR model, we need to create ACF and PACF plots to see what's happening with the correlation coefficients.

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
plot_acf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");
xxxxxxxxxx
And now let's make a PACF plot.

And now let's make a PACF plot.

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
fig = plot_pacf(y, ax=ax)
plt.xlabel("Lag [hours]")
plt.ylabel("Correlation Coefficient");
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Make a PAC and a PACF plot using your y_prac data.

[ ]:
xxxxxxxxxx
 
​
[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
## Splitting the Data

Splitting the Data¶

In our AR model, we split our data based on the number of observations we wanted to investigate. This time, we're going to split our data based on the date, using just the readings from October 2018. So, just like we did before, we'll create a training set using y, but instead of using percentages to split the data, we'll use dates.

[ ]:
xxxxxxxxxx
 
# Notice that the date format is `YYYY-MM-DD`
y_train = y.loc["2018-10-01":"2018-10-31"]
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Create a training dataset called y_prac_train based on November 2017. (Hint: there are 30 days in November.)

[ ]:
xxxxxxxxxx
 
y_prac_train = ...
xxxxxxxxxx
## Building the Model

Building the Model¶

xxxxxxxxxx
### Baseline

Baseline¶

The first thing we need to do is calculate the MAE for our new model:

[ ]:
xxxxxxxxxx
 
y_train_mean = y_train.mean()
y_pred_baseline = [y_train_mean] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
print("Mean P2 Reading:", round(y_train_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Calculate the mean and MAE for the y_prac Series, and print the results.

[ ]:
xxxxxxxxxx
 
y_prac_train_mean = ...
y_prac_pred_baseline = ...
mae_baseline_prac = ...
​
xxxxxxxxxx
### Iterating

Iterating¶

So far, the only difference between our old AR model and the new ARMA model we're building is that the new model's data is based on the date rather than on the length of the variable. But the difference between AR and ARMA is the addition of hyperparameters.

xxxxxxxxxx
### Hyperparameters

Hyperparameters¶

Let's set our p values to include values from 0 to 25, moving in steps of 8:

[ ]:
xxxxxxxxxx
 
p_params = range(0, 25, 8)
xxxxxxxxxx
And let's set our `q` values to include values from 0 to 3, moving in steps of 1:

And let's set our q values to include values from 0 to 3, moving in steps of 1:

[ ]:
xxxxxxxxxx
 
q_params = range(0, 3)
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Using p_params_prac, set the p value to include vales from 1 to 4, moving in steps of 1. Then, using q_params_prac, set the q value to include values from 0 to 3, moving in steps of 1.

[ ]:
xxxxxxxxxx
 
p_params_prac = ...
q_params_prac = ...
xxxxxxxxxx
In order to tell the model to keep going through all the possible combinations, we'll add in a pair of `for` loops. (If you need a refresher on `for` loops, refer to Notebook 001.)

In order to tell the model to keep going through all the possible combinations, we'll add in a pair of for loops. (If you need a refresher on for loops, refer to Notebook 001.)

[ ]:
xxxxxxxxxx
 
maes = dict()
for p in p_params:
    maes[p] = list()
    for q in q_params:
        order = (p, 0, q)
        start_time = time.time()
        # Here's where we actually define the model
        model = ARIMA(y_train, order=order).fit()
        # Here's where we tell the model how we want it to deal with time
        elapsed_time = round(time.time() - start_time, 2)
        print(f"Trained ARIMA {order} in {elapsed_time} seconds")
        # Here's where we get back into the MAE for the model
        y_pred = model.predict()
        mae = mean_absolute_error(y_train.iloc[24:], y_pred.iloc[24:])
        # And finally we append the MAES to the original list
        maes[p].append(mae)
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Create an ARMA model called mode_prac2 based on a dictionary called maes_prac, using your training and test data, then print the results and append the MAE to maes_prac.

[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
Now that we have a working ARMA model, let's turn the output into a DataFrame so we can see what happened.

Now that we have a working ARMA model, let's turn the output into a DataFrame so we can see what happened.

[ ]:
xxxxxxxxxx
 
mae_grid = pd.DataFrame(maes)
mae_grid.round(4)
xxxxxxxxxx
And let's visualize the DataFrame in a heatmap. (If you need a refresher on how to create a heatmap in seaborn, refer to Notebook 008.)

And let's visualize the DataFrame in a heatmap. (If you need a refresher on how to create a heatmap in seaborn, refer to Notebook 008.)

[ ]:
xxxxxxxxxx
 
sns.heatmap(mae_grid, cmap="Blues")
plt.xlabel("p values")
plt.ylabel("q values")
plt.title("Grid Search (Criterion: MAE)");
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Turn read the output of your ARMA model into a DataFrame called mae_grid_prac, and visualize it in a heatmap.

[ ]:
xxxxxxxxxx
 
mae_grid_prac = ...
​
​
xxxxxxxxxx
It looks like our MAE values are in the right place, but let's try some other ways to explore our new model using the `plot_diagnostics` method.

It looks like our MAE values are in the right place, but let's try some other ways to explore our new model using the plot_diagnostics method.

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 12))
model.plot_diagnostics(fig=fig);
xxxxxxxxxx
As usual, we have quite a lot to sift through here. The first graph is showing us our model's residuals. Ideally, we'd like to see this be as close to zero as possible, and this graph is telling us that, for the most part, we have a good model.

As usual, we have quite a lot to sift through here. The first graph is showing us our model's residuals. Ideally, we'd like to see this be as close to zero as possible, and this graph is telling us that, for the most part, we have a good model.

The next graph over shows us another version of the same thing. The histogram is similar to the one we made before, but there's a pair of lines superimposed. These lines are indicating the kernel density, which is another way of saying that it's a smoothed-out version of the blue histogram bars. The green line represents a normal distribution, and is included here just to give you something to compare to the values from your model. The orange line represents the smoothed-out version of the result off your model. Our model is actually pretty close to a normal distribution, so that's good!

The Q-Q plot on the bottom left is yet another way to visualize the same thing. Here, the red line is showing us a perfect 1:1 correlation between our variables, and the wavy blue line is showing us what we actually have. Again, our model is pretty close to the red line, so it's looking good.

And finally, we have a correlogram, which might look familiar; it's the same kind of plot as the ACF and PACF plots from our AR models.

Practice

Try it yourself! Use plot_diagnostics to examine the residuals from model_prac.

[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
## Communicating the Results

Communicating the Results¶

Now that we have an ARMA model that seems to be working well, it's time to communicate the results of our analysis in a line graph. Let's create a graph that shows the relationship between our training and predicting data. (For a refresher on how to do this and what it means, refer to the AR notebook.)

[ ]:
xxxxxxxxxx
 
y_train_pred = model.predict()
df_predictions = pd.DataFrame(
    {"y_train": y_train, "y_pred": y_train_pred}, index=y_train.index
)
fig = px.line(df_predictions, labels={"value": "P2"})
fig.show()
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Create a line plot that compares the y_prac_train and y_prac_pred values.

[ ]:
xxxxxxxxxx
 
​
xxxxxxxxxx
# ARCH and GARCH Models

ARCH and GARCH Models¶

xxxxxxxxxx
**ARCH** stands for autoregressive conditionally heteroscedastic, which models the variance of a time series. **ARCH** model assumes variance at time $t$ depends on the **past squared observations**. A **GARCH** (generalized autoregressive conditionally heteroscedastic) model uses values of the **past squared observations** and **past variances** to model the variance at time $t$. **ARCH** and **GARCH** models are widely used in finance and econometric time series analysis. For more details, check out these YouTube videos. 

ARCH stands for autoregressive conditionally heteroscedastic, which models the variance of a time series. ARCH model assumes variance at time 𝑡t depends on the past squared observations. A GARCH (generalized autoregressive conditionally heteroscedastic) model uses values of the past squared observations and past variances to model the variance at time 𝑡t. ARCH and GARCH models are widely used in finance and econometric time series analysis. For more details, check out these YouTube videos.

[ ]:
xxxxxxxxxx
 
YouTubeVideo("Li95a2biFCU")
[ ]:
xxxxxxxxxx
 
YouTubeVideo("inoBpq1UEn4")
xxxxxxxxxx
Let's see an example of using **GARCH** model in forecasting Apple stock price volatility. We first import the data.

Let's see an example of using GARCH model in forecasting Apple stock price volatility. We first import the data.

[ ]:
xxxxxxxxxx
 
df = pd.read_csv("./data/AAPL.csv", parse_dates=["date"], index_col="date")
​
print("df shape:", df.shape)
df.head()
xxxxxxxxxx
The dataset range from 1999 to 2022. For simplicity, we only take 5 years of data.

The dataset range from 1999 to 2022. For simplicity, we only take 5 years of data.

[ ]:
xxxxxxxxxx
 
df = df["2015-10-13":]
df.shape
xxxxxxxxxx
## Calculating Returns

Calculating Returns¶

xxxxxxxxxx
The next step is to calculate the volatility of the stock close prices. Here we can use the `pct_change` function from pandas to calculate the daily percentage change, then multiply the result by 100 to get the returns:

The next step is to calculate the volatility of the stock close prices. Here we can use the pct_change function from pandas to calculate the daily percentage change, then multiply the result by 100 to get the returns:

[ ]:
xxxxxxxxxx
 
df.sort_index(ascending=True, inplace=True)
df["return"] = df["close"].pct_change() * 100
df.head()
xxxxxxxxxx
We can then take out the return column as our training data. Note the first observation is missing since there is no past value to observe. We need to drop it:

We can then take out the return column as our training data. Note the first observation is missing since there is no past value to observe. We need to drop it:

[ ]:
xxxxxxxxxx
 
AAPL_return = df["return"].dropna()
AAPL_return.head()
xxxxxxxxxx
We can check the histogram to see the distribution of the returns over the past five years:

We can check the histogram to see the distribution of the returns over the past five years:

[ ]:
xxxxxxxxxx
 
plt.hist(AAPL_return, bins=50)
​
plt.xlabel("Returns")
plt.ylabel("Frequency [count]")
​
# Add title
plt.title("Distribution of AAPL Daily Returns");
xxxxxxxxxx
There's a negative outlier in this date range, with the `idxmin` function, we find out it was in 31 August 2020. The stock price has a huge drop due to a [stock split](https://www.cnbc.com/2020/08/31/history-of-apple-stock-splits-says-dont-rush-in-to-buy-cheaper-shares.html).

There's a negative outlier in this date range, with the idxmin function, we find out it was in 31 August 2020. The stock price has a huge drop due to a stock split.

[ ]:
xxxxxxxxxx
 
AAPL_return.idxmin(), AAPL_return.min()
xxxxxxxxxx
We can also check the standard deviation of the whole dataset with the pandas `std()` function:

We can also check the standard deviation of the whole dataset with the pandas std() function:

[ ]:
xxxxxxxxxx
 
AAPL_return.std()
xxxxxxxxxx
To see the statistic more clearly, we can use some time series plots. In addition to the Apple stock price return plot, we can also plot rolling volatility of the return to smooth out the noise, and see how return and volatility are associated with each other. First we can calculate a 50-day rolling standard deviation series for Apple stock price return:

To see the statistic more clearly, we can use some time series plots. In addition to the Apple stock price return plot, we can also plot rolling volatility of the return to smooth out the noise, and see how return and volatility are associated with each other. First we can calculate a 50-day rolling standard deviation series for Apple stock price return:

[ ]:
xxxxxxxxxx
 
AAPL_return_rolling_50d_volatility = AAPL_return.rolling(window=50).std().dropna()
AAPL_return_rolling_50d_volatility.head()
xxxxxxxxxx
We can plot these two series:

We can plot these two series:

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
AAPL_return.plot(ax=ax, label="daily return")
​
# Plot
AAPL_return_rolling_50d_volatility.plot(
    ax=ax, label="50d rolling volatility", linewidth=3
)
​
# Add axis labels
plt.xlabel("Date")
plt.ylabel("Return")
​
​
plt.legend();
xxxxxxxxxx
Here we can see that volatility goes up when the returns change drastically up or down. For instance, when return dropped enormously in 2020 August, volatility also increased a lot. However, this plot reveals a problem. We want to use returns to see if high volatility on one day is associated with high volatility on the following day. But high volatility is caused by large changes in returns, which can be either positive or negative. How can we assess negative and positive numbers together without them canceling each other out? The common solution used in building ARCH or GARCH models are to square the returns. Let's plot the squared returns:

Here we can see that volatility goes up when the returns change drastically up or down. For instance, when return dropped enormously in 2020 August, volatility also increased a lot. However, this plot reveals a problem. We want to use returns to see if high volatility on one day is associated with high volatility on the following day. But high volatility is caused by large changes in returns, which can be either positive or negative. How can we assess negative and positive numbers together without them canceling each other out? The common solution used in building ARCH or GARCH models are to square the returns. Let's plot the squared returns:

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
# Plot squared returns
(AAPL_return**2).plot(ax=ax, label="daily return")
​
# Add axis labels
plt.xlabel("date")
plt.ylabel("Squared Returns");
xxxxxxxxxx
 Now we it much easier to see groups high and low volatility and build a GARCH model. To build a GARCH model, we need to figure out both the `p` and `q` parameters. The `p` parameter is handling correlations at prior time steps and the `q` parameter deals with prior variances, like shocks. It also uses the notion of lag. To see how many lags we should have in our model, we should create an ACF and PACF plot — but using the squared returns. 

Now we it much easier to see groups high and low volatility and build a GARCH model. To build a GARCH model, we need to figure out both the p and q parameters. The p parameter is handling correlations at prior time steps and the q parameter deals with prior variances, like shocks. It also uses the notion of lag. To see how many lags we should have in our model, we should create an ACF and PACF plot — but using the squared returns.

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
# Create ACF of squared returns
plot_acf(AAPL_return**2, ax=ax)
​
# Add axis labels
plt.xlabel("Lag [days]")
plt.ylabel("Correlation Coefficient");
[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
# Create PACF of squared returns
plot_pacf(AAPL_return**2, ax=ax)
​
# Add axis labels
plt.xlabel("Lag [days]")
plt.ylabel("Correlation Coefficient");
xxxxxxxxxx
Both the ACF and PACF graph show one lag is enough to build the model.

Both the ACF and PACF graph show one lag is enough to build the model.

xxxxxxxxxx
## Building the Model

Building the Model¶

[ ]:
xxxxxxxxxx
 
# Build and train model
model = arch_model(AAPL_return, p=1, q=1, rescale=False).fit(disp=0)
​
print("model type:", type(model))
​
# Show model summary
model.summary()
xxxxxxxxxx
## Common Metrics

Common Metrics¶

xxxxxxxxxx
The model summary provides a lot of information about the model, like the trained parameters, model performance, etc. **AIC** and **BIC** are important measurements for model performance. **AIC** stands for Akaike Information Criterion, it measures the goodness of fit of any estimated statistical model. **BIC** is Bayesian Information Criteria that selects model from a finite set of models. When fitting models, it is possible to increase model performance by adding more parameters, which would also result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Though BIC is always higher than AIC, lower the value of these two measures, better the model.

The model summary provides a lot of information about the model, like the trained parameters, model performance, etc. AIC and BIC are important measurements for model performance. AIC stands for Akaike Information Criterion, it measures the goodness of fit of any estimated statistical model. BIC is Bayesian Information Criteria that selects model from a finite set of models. When fitting models, it is possible to increase model performance by adding more parameters, which would also result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Though BIC is always higher than AIC, lower the value of these two measures, better the model.

xxxxxxxxxx
## Standardized Residuals

Standardized Residuals¶

xxxxxxxxxx
After fitting the model with the data, we can get also get the **residuals** to see how the model performs. The residual at time $t$ is the observed return at time $t$ minus the model's estimated mean return at time $t$. For other models that model observations directly, like stock prices, the assumptions are the residuals are the noises that follow a normal distribution. In GARCH model, since we are model returns, which is the variance of the stock prices, the residuals of the variance will not follow a normal distribution. So we need to use **standardized residuals** instead of residuals to check the normality. Standardized residual at time $t$ is the residual at time $t$ divided by the square root of the model estimated variance at time $t$. Let's check the plot of the standardized residuals across the time series:

After fitting the model with the data, we can get also get the residuals to see how the model performs. The residual at time 𝑡t is the observed return at time 𝑡t minus the model's estimated mean return at time 𝑡t. For other models that model observations directly, like stock prices, the assumptions are the residuals are the noises that follow a normal distribution. In GARCH model, since we are model returns, which is the variance of the stock prices, the residuals of the variance will not follow a normal distribution. So we need to use standardized residuals instead of residuals to check the normality. Standardized residual at time 𝑡t is the residual at time 𝑡t divided by the square root of the model estimated variance at time 𝑡t. Let's check the plot of the standardized residuals across the time series:

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
# Plot standardized residuals
model.std_resid.plot(ax=ax, label="Standardized Residuals")
​
# Add axis labels
plt.xlabel("Date")
plt.ylabel("Value")
​
# Add legend
plt.legend();
xxxxxxxxxx
The standardized residuals are moving around 0 except for the outlier events from 2020 August. Let's check the normality by the histogram: 

The standardized residuals are moving around 0 except for the outlier events from 2020 August. Let's check the normality by the histogram:

[ ]:
xxxxxxxxxx
 
# Create histogram of standardized residuals
plt.hist(model.std_resid, bins=50)
​
# Add axis labels
plt.xlabel("Standardized Residual")
plt.ylabel("Frequency [count]")
​
# Add title
plt.title("Distribution of Standardized Residuals");
xxxxxxxxxx
If we exclude the outlier, the other standardized residuals do follow a normal distribution with mean 0.

If we exclude the outlier, the other standardized residuals do follow a normal distribution with mean 0.

xxxxxxxxxx
## Evaluation

Evaluation¶

We can further evaluate the model by comparing its forecast with a subset of the observed returns to see whether the model has successfully captured the volatility. We can first check the model conditional volatility in its confidence interval throughout the dataset:

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
# Plot `AAPL_return`
AAPL_return.plot(ax=ax)
​
# Plot conditional volatility * 2
(2 * model.conditional_volatility).plot(
    ax=ax, color="C1", label="2 SD Conditional Volatility"
)
​
​
# Plot conditional volatility * -2
(-2 * model.conditional_volatility.rename("")).plot(ax=ax, color="C1")
​
​
# Add axis labels
plt.xlabel("Date")
plt.ylabel("Daily Returns")
​
# Add legend
plt.legend();
xxxxxxxxxx
We can then select the test set of the data and do a walk-forward validation on the GARCH model:

We can then select the test set of the data and do a walk-forward validation on the GARCH model:

[ ]:
xxxxxxxxxx
 
# Create empty list to hold predictions
predictions = []
​
# Calculate size of test data (20%)
test_size = int(len(AAPL_return) * 0.2)
​
# Walk forward
for i in range(test_size):
    # Create test data
    y_train = AAPL_return.iloc[: -(test_size - i)]
​
    # Train model
    model = arch_model(y_train, p=1, q=1, rescale=False).fit(disp=0)
​
    # Generate next prediction (volatility, not variance)
​
    next_pred = model.forecast(horizon=1, reindex=False).variance.values[0][0] ** 0.5
​
    # Append prediction to list
    predictions.append(next_pred)
​
# Create Series from predictions list
y_test_wfv = pd.Series(predictions, index=AAPL_return.tail(test_size).index)
​
print("y_test_wfv type:", type(y_test_wfv))
print("y_test_wfv shape:", y_test_wfv.shape)
y_test_wfv.head()
xxxxxxxxxx
We can plot the predicted volatility versus return in this test set, and see the model has captured most of the variance here:

We can plot the predicted volatility versus return in this test set, and see the model has captured most of the variance here:

[ ]:
xxxxxxxxxx
 
fig, ax = plt.subplots(figsize=(15, 6))
​
# Plot returns for test data
AAPL_return.tail(test_size).plot(ax=ax, label="AAPL return")
​
# Plot volatility predictions * 2
(2 * y_test_wfv).plot(ax=ax, c="C1", label="2 SD Predicted Volatility")
​
# Plot volatility predictions * -2
(-2 * y_test_wfv).plot(ax=ax, c="C1")
​
# Label axes
plt.xlabel("Date")
plt.ylabel("Return")
​
# Add legend
plt.legend();
xxxxxxxxxx
## Forecasting

Forecasting¶

xxxxxxxxxx
The last step is to make the forecasts with our trained model. Let's make a one-day forecast first to see how GARCH model forecasting works:

The last step is to make the forecasts with our trained model. Let's make a one-day forecast first to see how GARCH model forecasting works:

[ ]:
xxxxxxxxxx
 
one_day_forecast = model.forecast(horizon=1, reindex=False).variance
​
print("one_day_forecast type:", type(one_day_forecast))
one_day_forecast
xxxxxxxxxx
There are two things we need to keep in mind here. First, our model forecast shows the predicted variance, not the standard deviation / volatility. So we'll need to take the square root of the value. Second, the prediction's index is in the format of `"h.1"` as the next date of the last training data date. It's not in the form of DatetimeIndex that we desire. So we need to format the forecasts to be more readable.

There are two things we need to keep in mind here. First, our model forecast shows the predicted variance, not the standard deviation / volatility. So we'll need to take the square root of the value. Second, the prediction's index is in the format of "h.1" as the next date of the last training data date. It's not in the form of DatetimeIndex that we desire. So we need to format the forecasts to be more readable.

xxxxxxxxxx
We first generate 5 days predictions with our trained GARCH model:

We first generate 5 days predictions with our trained GARCH model:

[ ]:
xxxxxxxxxx
 
# Generate 5-day volatility forecast
prediction = model.forecast(horizon=5, reindex=False).variance ** 0.5
​
# Calculate forecast start date
start = prediction.index[0] + pd.DateOffset(days=1)
​
# Create date range
prediction_dates = pd.bdate_range(start=start, periods=prediction.shape[1])
​
# Create prediction index labels, ISO 8601 format
prediction_index = [d.isoformat() for d in prediction_dates]
​
print("prediction_index type:", type(prediction_index))
print("prediction_index len:", len(prediction_index))
prediction_index[:3]
xxxxxxxxxx
Then we define a function to clean the predictions to be more readable:

Then we define a function to clean the predictions to be more readable:

[ ]:
xxxxxxxxxx
 
# Take the square root of the model forecasts
data = prediction.values.flatten() ** 0.5
​
# Format the forecasts with the correct date
prediction_formatted = pd.Series(data, index=prediction_index)
​
# Show the results
prediction_formatted.to_dict()
xxxxxxxxxx
# References & Further Reading

References & Further Reading¶

  • More on ARMA models
  • Even more in ARMA models
  • Information on p and q parameters
  • A primer on autoregression
  • More information on ACF plots
  • A primer on ACF and PACF
  • Background on residuals
  • More on walk-forward validation
  • Reading on parameters and hyperparameters
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

xxxxxxxxxx
​

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>The Linux Command Line</strong></font>

The Linux Command Line

xxxxxxxxxx
Linux is an open source operating system. On computers operating with Linux, you can pass commands using a text interface to your computer which then later returns what you request. The text interface could be a shell or a terminal.

Linux is an open source operating system. On computers operating with Linux, you can pass commands using a text interface to your computer which then later returns what you request. The text interface could be a shell or a terminal.

xxxxxxxxxx
# Navigating Files and Directories

Navigating Files and Directories¶

We can run command directly in Jupyter Notebook using by adding the magic command %%bash:

xxxxxxxxxx
First we can look at current directory we are in:

First we can look at current directory we are in:

[1]:
 
%%bash
​
pwd
/home/jovyan/work/ds-curriculum/@textbook
xxxxxxxxxx
From the current directory, we can see what files are inside this directory

From the current directory, we can see what files are inside this directory

[2]:
 
%%bash
​
ls
01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
01-python-getting-started.ipynb
02-python-advanced.ipynb
03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
03-pandas-getting-started.ipynb
04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
04-pandas-advanced.ipynb
05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
05-pandas-summary-statistics.ipynb
06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
06-visualization-matplotlib.ipynb
07-visualization-pandas.ipynb
08-visualization-plotly.ipynb
09-visualization-seaborn.ipynb
10-databases-sql.ipynb
11-databases-mongodb.ipynb
12-ml-core.ipynb
13-ml-data-pre-processing-and-production.ipynb
14-ml-classification.ipynb
15-ml-regression.ipynb
16-ml-unsupervised-learning.ipynb
17-ts-core.ipynb
18-ts-models.ipynb
19-linux-command-line.ipynb
20-statistics.ipynb
21-python-object-oriented-programming.ipynb
22-apis.ipynb
data
main.py
xxxxxxxxxx
We can go further, into the `data` directory using `cd` and the directory name

We can go further, into the data directory using cd and the directory name

[3]:
 
%%bash
​
cd data
ls
AAPL.csv
clothes.pkl
colombia-real-estate-1.csv
colombia-real-estate-2.csv
colombia-real-estate-3.csv
colombia-real-estate-chi-square.csv
config.py
mexico-city-real-estate-1.csv
mexico-city-real-estate-2.csv
mexico-city-real-estate-3.csv
mexico-city-real-estate-4.csv
mexico-city-real-estate-5.csv
mexico-city-test-features.csv
mexico-city-test-labels.csv
numlist.pkl
poland-bankruptcy-data-2007.json.gz
poland-bankruptcy-data-2008.json.gz
SCFP2019-textbook.csv.gz
xxxxxxxxxx
Now we are in the `work` directory

Now we are in the work directory

xxxxxxxxxx
If we want to go back to the previous directory

If we want to go back to the previous directory

[4]:
 
%%bash
​
cd ..
pwd
/home/jovyan/work/ds-curriculum
xxxxxxxxxx
or use `cd` to go directly back to home directory

or use cd to go directly back to home directory

[5]:
 
%%bash
​
cd
pwd
/home/jovyan
xxxxxxxxxx
<font size="+1">Practice</font> 

Practice

Navigate to the ds_curriculum/010-housing-in-mexico directory, and list all files there.

[6]:
 
%%bash
​
UsageError: %%bash is a cell magic, but the cell body is empty.
xxxxxxxxxx
# Viewing File Contents 

Viewing File Contents¶

xxxxxxxxxx
### Navigate into the data directory

Navigate into the data directory¶

[ ]:
 
%%bash
​
pwd
[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
pwd
xxxxxxxxxx
### Create a `.txt` file using Context Manager

Create a .txt file using Context Manager¶

[ ]:
xxxxxxxxxx
 
with open("data/example.txt", "w") as f:
    f.write("Hello")
xxxxxxxxxx
#### View the head of this file

View the head of this file¶

[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
head example.txt
xxxxxxxxxx
#### View head and tail for file with multiple lines

View head and tail for file with multiple lines¶

[ ]:
xxxxxxxxxx
 
with open("data/example.txt", "w") as f:
    f.write("Hello")
    f.write("\n")
    f.write("Hola")
[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
head example.txt
xxxxxxxxxx
#### View the tail of this file

View the tail of this file¶

[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
tail -1 example.txt
xxxxxxxxxx
# Compressing and Decompressing Files

Compressing and Decompressing Files¶

xxxxxxxxxx
## Data Compressing

Data Compressing¶

xxxxxxxxxx
`Data Compression` is the process of encoding information using fewer bits than a decoded representation would use through the use of specific encoding schemes. Compression is needed because it helps to reduce the consumption of expensive resources such as a hard disk or transmission bandwidth.

Data Compression is the process of encoding information using fewer bits than a decoded representation would use through the use of specific encoding schemes. Compression is needed because it helps to reduce the consumption of expensive resources such as a hard disk or transmission bandwidth.

xxxxxxxxxx
### `gzip`

gzip¶

xxxxxxxxxx
`gzip` is a form of data compression. The original data can be restored by unzipping the compressed file. `gzip` can compress all files; it doesn't make any difference what the file type is or the encoding. Obviously, some files can be compressed more effectively than others, so the bandwidth saving will vary - text files like HTML give the best results; images are not compressed so much by `gzip` because they already have some compression built-in. 

gzip is a form of data compression. The original data can be restored by unzipping the compressed file. gzip can compress all files; it doesn't make any difference what the file type is or the encoding. Obviously, some files can be compressed more effectively than others, so the bandwidth saving will vary - text files like HTML give the best results; images are not compressed so much by gzip because they already have some compression built-in.

xxxxxxxxxx
Here are the code to compress and decompress files using `gzip`:

Here are the code to compress and decompress files using gzip:

xxxxxxxxxx
First we go back to home directory, and navigate towards the directory where the file is at.

First we go back to home directory, and navigate towards the directory where the file is at.

[ ]:
xxxxxxxxxx
 
%%bash
​
cd
pwd
[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
gzip example.txt
ls
xxxxxxxxxx
Now we can see the `example.txt.gz` file in the directory

Now we can see the example.txt.gz file in the directory

xxxxxxxxxx
We can unzip the file with the code below:

We can unzip the file with the code below:

[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
gzip -d example.txt.gz
ls
xxxxxxxxxx
Now you see the unzipped `example.txt` file

Now you see the unzipped example.txt file

xxxxxxxxxx
We can also customized the name of the zipped file using the code below:<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

We can also customized the name of the zipped file using the code below:WQU WorldQuant University Applied Data Science Lab QQQQ

[ ]:
xxxxxxxxxx
 
%%bash
​
cd data
gzip -c example.txt > output.txt.gz
ls
xxxxxxxxxx
 `example.txt` and `output.txt.gz` are both in the directory

example.txt and output.txt.gz are both in the directory

xxxxxxxxxx
### Practice: Create a text file in data directory and zip it.

Practice: Create a text file in data directory and zip it.¶

xxxxxxxxxx
Create a text file in the `data` directory called `text.txt`, and fill it with any text you want.

Create a text file in the data directory called text.txt, and fill it with any text you want.

[ ]:
xxxxxxxxxx
 
# create text file called text file, write any random text in it
​
​
xxxxxxxxxx
Next, compress `text.txt` using `gzip`.

Next, compress text.txt using gzip.

[ ]:
xxxxxxxxxx
 
%%bash
​
​
xxxxxxxxxx
# References and Further Reading

References and Further Reading¶

xxxxxxxxxx
- [Software Carpentries, *The Unix Shell*](https://swcarpentry.github.io/shell-novice/)
  • Software Carpentries, The Unix Shell
  • Data Compressing
  • Stack Overflow: What is gzip
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

xxxxxxxxxx
​

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Statistics</strong></font>

Statistics

[1]:
 
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import YouTubeVideo
xxxxxxxxxx
# Distributions

Distributions¶

xxxxxxxxxx
## Normal Distribution

Normal Distribution¶

Normal distribution is the most widely used continuous distribution. It's a bell-shaped distribution that has the following unique characteristics:

  1. Normal distributions are defined by only two parameters, the mean (𝜇μ) and the standard deviation (𝜎σ).
  2. The mean, median and mode are all equal.
  3. The distribution is symmetric at the mean.
  4. 68% of the area of a normal distribution is within one standard deviation of the mean, 95% of the area is within two standard deviation of the mean, and 99.7% is within three standard deviations of the mean.
xxxxxxxxxx
The following graph shows an example of a Normal Distribution. You will see how it's bell shaped and symmetric at the mean.

The following graph shows an example of a Normal Distribution. You will see how it's bell shaped and symmetric at the mean.

xxxxxxxxxx
![normal distribution](../images/normal-dist-pdf.png)

normal distribution

[2]:
xxxxxxxxxx
 
from scipy.stats import norm
​
fig, ax = plt.subplots()
# Define a normal distribution with mean 0 and standard deviation 1
x = np.arange(-4, 4, 0.001)
ax.plot(x, norm.pdf(x, loc=0, scale=1))
ax.set_ylim(0, 0.45)
​
ax.set_xlabel("x")
ax.set_ylabel("pdf(x)")
ax.grid(True)
​
# fill one standard deviation of mean with red
px = np.arange(-1, 1, 0.01)
ax.fill_between(px, norm.pdf(px, loc=0, scale=1), alpha=0.5, color="r")
​
# fill two standard deviation of mean with blue
px = np.arange(-2, -1, 0.01)
ax.fill_between(px, norm.pdf(px, loc=0, scale=1), alpha=0.5, color="b")
​
px = np.arange(1, 2, 0.01)
ax.fill_between(px, norm.pdf(px, loc=0, scale=1), alpha=0.5, color="b")
​
# fill two standard deviation of mean with green
px = np.arange(-4, -2, 0.01)
ax.fill_between(px, norm.pdf(px, loc=0, scale=1), alpha=0.5, color="g")
​
px = np.arange(2, 4, 0.01)
ax.fill_between(px, norm.pdf(px, loc=0, scale=1), alpha=0.5, color="g")
​
# 68% of the area of a normal distribution is within one standard deviation of the mean
ax.text(-0.5, 0.2, "68%", fontsize=20)
# 95% of the area is within two standard deviation of the mean
ax.text(-2.5, 0.01, "5%", fontsize=10)
​
plt.show()
xxxxxxxxxx
# Probability Densities

Probability Densities¶

If you were an engineer designing the interior of an airplane, you'd need to balance precision and uncertainty. On the one hand, you'd need to provide exact specifications and measurements for the airplane seats, but those seats need to accommodate passengers whose height and weight you don't know. After all, there is a lot of variation from one passenger to the next! So how can you mathematically represent the measurements for height and weight in such a way that you can incorporate them into the engineering formulas needed to create your design specifications? The answer is a probability density function (PDF).

By definition, a probability density function shows the chance of observing an instance of random variable 𝑋X that equals a particular value 𝑥x.

𝑓(𝑥)=𝑃(𝑋=𝑥)f(x)=P(X=x)

Going back to our airplane seat example, 𝑋X could represent the height of an adult male passenger. It turns out that, if you measure thousands of American men, you'll end up with a mean height of 177cm and a standard deviation of 7cm. With that information, you can create a PDF that looks like a normal distribution.

Using the SciPy library, we can use a probability density function to plot the distribution of heights for American men.

[3]:
 
import numpy as np
from scipy.stats import norm
​
# define plot range for x axis
x = np.linspace(149, 205, 1000)
​
# plot PDF for a normal distribution with mean equals to 0 and standard deviation equals 1.
y = norm.pdf(x, loc=177, scale=7)
​
plt.plot(x, y)
plt.title("Distribution of Heights for American Men")
plt.xlabel("Height [cm]")
plt.ylabel("Frequency [%]");
xxxxxxxxxx
# Cumulative Density Functions

Cumulative Density Functions¶

The Cumulative Density Function CDF is the function that evaluates the probability of the random variable 𝑋X takes on a value less than or equal to 𝑥x for a distribution. In formula, we have:

𝐹𝑋(𝑥)=𝑃(𝑋≤𝑥)FX(x)=P(X≤x)

We can calculate the CDF of a normal distribution using SciPy. First we define a normal distribution from Scipy stats library. By default, the normal distribution has mean equals to 0 and standard deviation equals to 1.

[4]:
 
from scipy.stats import norm
​
# get mean and standard deviation
mean = norm.mean()
std = norm.std()
​
print(f"The mean of this Normal Distribution is: {mean}")
print(f"The standard deviation of this Normal Distribution is: {std}")
The mean of this Normal Distribution is: 0.0
The standard deviation of this Normal Distribution is: 1.0
xxxxxxxxxx
By definition, the area under the PDF is 1 by definition, and since the normal distribution is symmetric around the mean, the **CDF** at 0 should be 0.5.

By definition, the area under the PDF is 1 by definition, and since the normal distribution is symmetric around the mean, the CDF at 0 should be 0.5.

[5]:
 
# Get `cdf` at 9
cdf_value = norm.cdf(0)
​
print(f"CDF at 0 is: {cdf_value}")
CDF at 0 is: 0.5
xxxxxxxxxx
We can also pass a list of values to check **CDF**:

We can also pass a list of values to check CDF:

[6]:
 
cdf_value = norm.cdf([-std, std])
cdf_value
[6]:
array([0.15865525, 0.84134475])
xxxxxxxxxx
Note the difference between `norm.cdf(1)` and `norm.cdf(-1)` is the probability of $x$ being within one standard deviation of the mean, either left or right. According to normal distribution property, the probability should be around 68%:

Note the difference between norm.cdf(1) and norm.cdf(-1) is the probability of 𝑥x being within one standard deviation of the mean, either left or right. According to normal distribution property, the probability should be around 68%:

[7]:
 
norm.cdf(1) - norm.cdf([-1])
[7]:
array([0.68268949])
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Calculate CDF for -2 and 2.

[8]:
 
cdf_value = ...
cdf_value
[8]:
Ellipsis
xxxxxxxxxx
# Central Limit Theorem

Central Limit Theorem¶

xxxxxxxxxx
The **Central Limit Theorem (CTL)** states that no matter what the population’s original distribution is, when taking random samples from the population, the distribution of the means or sums from the random samples approaches a normal distribution, where the mean equals the population mean, as the random sample size gets larger.

The Central Limit Theorem (CTL) states that no matter what the population’s original distribution is, when taking random samples from the population, the distribution of the means or sums from the random samples approaches a normal distribution, where the mean equals the population mean, as the random sample size gets larger.

xxxxxxxxxx
## CTL for Random Samples

CTL for Random Samples¶

xxxxxxxxxx
We can simulate the Central Limit Theorem by taking sub samples from a random sampled population, then taking the means of each sub sample and plot the means to see whether they follow a normal distribution. Here is the code for the simulation:

We can simulate the Central Limit Theorem by taking sub samples from a random sampled population, then taking the means of each sub sample and plot the means to see whether they follow a normal distribution. Here is the code for the simulation:

[9]:
xxxxxxxxxx
 
import matplotlib.pyplot as plt
import numpy as np
​
# generate a sample from a random sample whose size is 10000
lis = np.random.random(size=10000)
plt.hist(lis);
xxxxxxxxxx
We can clearly see the distribution is not a Normal Distribution. Now let's take sub samples from the random sample, then take the mean of each sub sample.

We can clearly see the distribution is not a Normal Distribution. Now let's take sub samples from the random sample, then take the mean of each sub sample.

[10]:
xxxxxxxxxx
 
means = []
# take 1000 random sub sample, whose size is also 1000
for i in range(1000):
​
    # take a random sub sample by the index numbers of the previous random sample
    lis_index = np.random.randint(0, 10000, 1000)
​
    # collect the means from each sub sample
    means.append(lis[lis_index].mean())
​
    # continue taking sub samples and collect means
    i += 1
​
# plot the distribution of means
plt.hist(means);
xxxxxxxxxx
We can see the distribution above has a bell shape, like the normal distribution.

We can see the distribution above has a bell shape, like the normal distribution.

xxxxxxxxxx
## CTL for Sums

CTL for Sums¶

xxxxxxxxxx
Simply change mean to sum, we can see the central limit theorem also holds for sums:

Simply change mean to sum, we can see the central limit theorem also holds for sums:

[11]:
xxxxxxxxxx
 
lis = np.random.random(size=10000)
plt.hist(lis);
xxxxxxxxxx
The above distribution is from a random sample, not a normal distribution for sure. Now we take sums from each sub samples like how we took means from the previous example:<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

The above distribution is from a random sample, not a normal distribution for sure. Now we take sums from each sub samples like how we took means from the previous example:WQU WorldQuant University Applied Data Science Lab QQQQ

[12]:
xxxxxxxxxx
 
sums = []
# take 1000 random sub sample, whose size is also 1000
for i in range(1000):
    # take a random sub sample
    lis_index = np.random.randint(0, 10000, 1000)
​
    # collect the sums from each sub sample
    sums.append(lis[lis_index].sum())
    i += 1
​
plt.hist(sums);
xxxxxxxxxx
Now we can see the distribution of sums is much like a bell shaped Normal Distribution.

Now we can see the distribution of sums is much like a bell shaped Normal Distribution.

xxxxxxxxxx
Check out this YouTube video from Khan Academy for more detailed explanation in `Sampling`.

Check out this YouTube video from Khan Academy for more detailed explanation in Sampling.

[13]:
xxxxxxxxxx
 
YouTubeVideo("z0Ry_3_qhDw")
[13]:
xxxxxxxxxx
Check out this YouTube video for an example of how to use the Central Limit Theorem for sums.

Check out this YouTube video for an example of how to use the Central Limit Theorem for sums.

[14]:
xxxxxxxxxx
 
YouTubeVideo("wY3TSCG-G3Q")
[14]:
xxxxxxxxxx
# Hypothesis Testing

Hypothesis Testing¶

xxxxxxxxxx
**Hypothesis Testing** is a method of statistical inference. First, we define a **null hypothesis** ($H_0$), which usually proposes that no statistical relationship or significance exists. Next, we have an **alternative hypothesis** ($H_a$ or $H_a$), which proposes that there is a statistical relationship. For example, if we want to test whether there is a significant difference in salaries between employees with and without advanced degrees:

Hypothesis Testing is a method of statistical inference. First, we define a null hypothesis (𝐻0H0), which usually proposes that no statistical relationship or significance exists. Next, we have an alternative hypothesis (𝐻𝑎Ha or 𝐻𝑎Ha), which proposes that there is a statistical relationship. For example, if we want to test whether there is a significant difference in salaries between employees with and without advanced degrees:

  • The null hypothesis is there is no significant difference in salaries between employees with and without advanced degrees;
  • The alternative hypothesis is there is a significant difference in salaries between employees with and without advanced degrees.

Hypothesis testing is a two-step process that tests whether the null hypothesis is true based on data collected from a survey or an experiment. First, we calculate the probability of observing some effect in the data, assuming the null hypothesis is true. Second, we compare the probability from the previous calculation with the p-value we've chosen (usually 0.05). If the probability is smaller than the p-value, we reject the null hypothesis because the effect we're observing is probably true. If the probability is larger than the p-value, we fail to reject the null hypothesis because the effect we're observing is probably false.

Either way, we can never be 100% certain our observed results are real, because we're dealing with probability. Even if we're almost certainly correct, the closest we can get is almost, because there's always a possibility — no matter how small — that our results are wrong. Maybe we found something that wasn't actually there, or we didn't find something that actually was. In statistics, we call those two mistakes type errors. They work like this:

  • In a Type I error, the null hypothesis is actually true, but we reject it, which is called False Positive, we also call it 𝛼α;
  • In a Type II error, the null hypothesis is actually False, but we fail to reject it, which is called False Negative, we also call it 𝛽β;
𝐻0H0 is rejected 𝐻0H0 is not rejected
𝐻0H0 is False 1- 𝛽β False Negative 𝛽β
𝐻0H0 is True False Positive 𝛼α 1 - 𝛼α

For example, False Positive is when salaries between employees with and without advanced degrees are not significantly different (𝐻0H0 True), however, we conclude that they are significantly different (rejecting 𝐻0H0). While False Negative is when salaries between employees with and without advanced degrees are significantly different (𝐻0H0 False), but we find that they are not significantly different (not rejecting 𝐻0H0).

xxxxxxxxxx
## Power

Power¶

In the table above when we rejected 𝐻0H0 when 𝐻0H0 is False, the probability is 1-𝛽β, which is also called the statistical power of the test. It represents the correct rejection of the null hypothesis. You want the value of the statistical power to be as large as possible for a test.

xxxxxxxxxx
# Effect Size

Effect Size¶

xxxxxxxxxx
**Effect size** measures the strength of a relationship between two variables or same variable from different samples. For example, we compare an experiment result from treatment group and control group, and we measure the different between the two groups to see the effect size of the experiment. One common measurement for effect size in this case is called **Cohen's d**, which measures the difference between two means from treatment and control group, divided by the standard deviation of the data:

Effect size measures the strength of a relationship between two variables or same variable from different samples. For example, we compare an experiment result from treatment group and control group, and we measure the different between the two groups to see the effect size of the experiment. One common measurement for effect size in this case is called Cohen's d, which measures the difference between two means from treatment and control group, divided by the standard deviation of the data:

𝑑=𝑥¯1−𝑥¯2𝑠d=x¯1−x¯2s

xxxxxxxxxx
# Chi Square

Chi Square¶

Among all hypothesis testing via statistical models, the Chi-square test of independence is widely used to test whether there is a significant relationship between two nominal variables. The Null Hypothesis 𝐻0H0 and the Alternative Hypothesis 𝐻𝑎Ha for testing the independence between variable X and variable Y are framed as follows:

  • Null Hypothesis (𝐻0H0): There is no relationship between X and Y, i.e. X and Y are independent to each other;

  • Alternative Hypothesis (𝐻𝑎Ha): There is a relationship to X and Y.

In the following section, we will use the Colombia real estate data set to illustrate the process of conducting a Chi-square test of independence. First, we read the dataset:

[15]:
xxxxxxxxxx
 
import pandas as pd
​
df = pd.read_csv("data/colombia-real-estate-chi-square.csv")
df.head()
[15]:
property_type department lat lon area_m2 price_usd advertising sold_in_2_wks
0 house Valle del Cauca 3.404 -76.506 114.0 $82,724.99 radio unsold
1 house Bogotá D.C 4.641 -74.058 70.0 $162,073.45 radio unsold
2 house Cundinamarca 4.885 -74.010 260.0 $422,066.30 radio sold
3 house Bogotá D.C 4.747 -74.072 122.0 $185,709.17 radio unsold
4 house Bolívar 10.412 -75.485 82.0 $63,613.83 radio sold
xxxxxxxxxx
We will focus on two variables: `"advertising"` and `"sold_in_2_wks"`. There are two ways of advertising, through radio or internet. We want to test whether these two ways of advertising affects the probability of selling the property within two weeks. That is same as testing whether `"advertising"` and `"sold_in_2_wks"` are independent of each other.

We will focus on two variables: "advertising" and "sold_in_2_wks". There are two ways of advertising, through radio or internet. We want to test whether these two ways of advertising affects the probability of selling the property within two weeks. That is same as testing whether "advertising" and "sold_in_2_wks" are independent of each other.

[16]:
xxxxxxxxxx
 
df["advertising"].value_counts()
[16]:
radio       1533
internet    1533
Name: advertising, dtype: int64
[17]:
xxxxxxxxxx
 
df["sold_in_2_wks"].value_counts()
[17]:
sold      1809
unsold    1257
Name: sold_in_2_wks, dtype: int64
xxxxxxxxxx
We then use the Chi-square test of independence and state the Null Hypothesis and Alternative Hypothesis:

We then use the Chi-square test of independence and state the Null Hypothesis and Alternative Hypothesis:

  • Null Hypothesis (𝐻0H0): "advertising" and "sold_in_2_wks" are independent of each other.
  • Alternative Hypothesis (𝐻𝑎Ha): "advertising" and "sold_in_2_wks" dependent on each other.
xxxxxxxxxx
## Contingency Tables

Contingency Tables¶

In the Chi-Square test, we display the data in a cross-tabulation (contingency) format showing frequency count of each group of the two categorical variable rows and columns, which is call the contingency table. We can get the contingency table for "advertising" and "sold_in_2_wks" directly using Pandas crosstab.

[18]:
xxxxxxxxxx
 
tab = pd.crosstab(df["advertising"], df["sold_in_2_wks"])
tab
[18]:
sold_in_2_wks sold unsold
advertising
internet 1104 429
radio 705 828
xxxxxxxxxx
Alternatively, we can also use `statsmodels` to construct contingency table and conduct more analysis.

Alternatively, we can also use statsmodels to construct contingency table and conduct more analysis.

[19]:
xxxxxxxxxx
 
from statsmodels.stats.contingency_tables import Table2x2
​
contingency_table = Table2x2(tab.values)
contingency_table
[19]:
<statsmodels.stats.contingency_tables.Table2x2 at 0x7f370e47ae20>
[20]:
xxxxxxxxxx
 
# Elements in contingency table
contingency_table.table_orig
[20]:
array([[1104,  429],
       [ 705,  828]])
xxxxxxxxxx
The fitted values are the estimated frequency counts for each cell under the assumption that `"advertising"` and `"sold_in_2_wks"` are independent to each other. Suppose the probability of `"internet"` and `"radio"` are $p$ and $1-p$, and the probability of `"sold"` and `"unsold"` is $q$ and $1-q$, the probability for each cell should be as follows:

The fitted values are the estimated frequency counts for each cell under the assumption that "advertising" and "sold_in_2_wks" are independent to each other. Suppose the probability of "internet" and "radio" are 𝑝p and 1−𝑝1−p, and the probability of "sold" and "unsold" is 𝑞q and 1−𝑞1−q, the probability for each cell should be as follows:

𝑝p 1−𝑝1−p
𝑞q 𝑝∗𝑞p∗q (1−𝑝)∗𝑞(1−p)∗q
1−𝑞1−q 𝑝∗(1−𝑞)p∗(1−q) (1−𝑝)∗(1−𝑞)(1−p)∗(1−q)

We can check the joint distribution of the contingency table through independence_probabilities:

[21]:
xxxxxxxxxx
 
contingency_table.independence_probabilities.round(2)
[21]:
array([[0.3, 0.2],
       [0.3, 0.2]])
xxxxxxxxxx
Then we apply the total sample size N to each cell's probability to calculate the fitted value. That's where the `fittedvalues` comes from. 

Then we apply the total sample size N to each cell's probability to calculate the fitted value. That's where the fittedvalues comes from.

[22]:
xxxxxxxxxx
 
# Get fitted value
​
contingency_table.fittedvalues
[22]:
array([[904.5, 628.5],
       [904.5, 628.5]])
xxxxxxxxxx
Another useful information we can get from the probability of the contingency table is called the **odds ratio**. **odds** for an event with probability $p$ is: 

Another useful information we can get from the probability of the contingency table is called the odds ratio. odds for an event with probability 𝑝p is:

𝑝1−𝑝p1−p
.

The odds ratio is a ratio of odds between two groups, one with probability 𝑝p and one with probability 𝑞q, the formula is:

𝑝/(1−𝑝)𝑞/(1−𝑞)p/(1−p)q/(1−q)

From the table of probability above, we can calculate odds ratio for the contingency table.

[23]:
xxxxxxxxxx
 
# Get odds ratio
​
contingency_table.oddsratio
[23]:
3.0224073798541884
xxxxxxxxxx
## Test for Independence

Test for Independence¶

xxxxxxxxxx
Before moving to Chi square test statistic, we need to check whether we have enough observations to drive meaningful conclusions. For an effect size at `0.2`, significance level at 95% ($\alpha$ is `0.05`), and power at `0.8`, we can use the `GofChisquarePower` function to conduct power analysis. First we check whether we have enough observations, then we can manipulate different effect sizes to see how sample sizes, power and effect sizes are correlated with each other. 

Before moving to Chi square test statistic, we need to check whether we have enough observations to drive meaningful conclusions. For an effect size at 0.2, significance level at 95% (𝛼α is 0.05), and power at 0.8, we can use the GofChisquarePower function to conduct power analysis. First we check whether we have enough observations, then we can manipulate different effect sizes to see how sample sizes, power and effect sizes are correlated with each other.

[24]:
xxxxxxxxxx
 
import math
​
from statsmodels.stats.power import GofChisquarePower
​
chi_square_power = GofChisquarePower()
​
group_size = math.ceil(
    chi_square_power.solve_power(effect_size=0.2, alpha=0.05, power=0.8)
)
​
print("Group size:", group_size)
print("Total # observations for two groups:", group_size * 2)
print("Total # observations in the data:", df.shape[0])
Group size: 197
Total # observations for two groups: 394
Total # observations in the data: 3066
xxxxxxxxxx
In the following graph, let's see how effect size is correlated with sample size. If we assume power is at 0.8, to detect a 0.2 effect size, we need around 250 samples. While for an effect size at 0.5, we only need 50 observations, and the number reduced further for an even bigger effect size at 0.8. This means it is much harder to detect a smaller effect size.

In the following graph, let's see how effect size is correlated with sample size. If we assume power is at 0.8, to detect a 0.2 effect size, we need around 250 samples. While for an effect size at 0.5, we only need 50 observations, and the number reduced further for an even bigger effect size at 0.8. This means it is much harder to detect a smaller effect size.

[25]:
xxxxxxxxxx
 
import numpy as np
​
n_observations = np.arange(0, group_size * 2)
effect_sizes = np.array([0.2, 0.5, 0.8])
​
# Plot power curve using `chi_square_power`
​
chi_square_power.plot_power(
    dep_var="nobs", nobs=n_observations, effect_size=effect_sizes, alpha=0.05, n_bins=2
);
xxxxxxxxxx
Lastly, we can conduct the test and calculate the p values using `test_nominal_association()`:

Lastly, we can conduct the test and calculate the p values using test_nominal_association():

[26]:
xxxxxxxxxx
 
chi_square_test = contingency_table.test_nominal_association()
​
print("chi_square_test type:", type(chi_square_test))
print(chi_square_test)
chi_square_test type: <class 'statsmodels.stats.contingency_tables._Bunch'>
df          1
pvalue      0.0
statistic   214.65652643702725
xxxxxxxxxx
We can see the *p-value* is around 0, which is much smaller than 0.05, the $\alpha$ value we set ahead. We can then reject the Null Hypothesis, which states `"advertising"` and `"sold_in_2_wks"` are independent to each other. This is the same as saying `"advertising"` and `"sold_in_2_wks"` are somewhat dependent with each other.

We can see the p-value is around 0, which is much smaller than 0.05, the 𝛼α value we set ahead. We can then reject the Null Hypothesis, which states "advertising" and "sold_in_2_wks" are independent to each other. This is the same as saying "advertising" and "sold_in_2_wks" are somewhat dependent with each other.

xxxxxxxxxx
# References & Further Reading 

References & Further Reading¶

  • Online Statistics Education: An Interactive Multimedia Course of Study
  • statsmodels documentation
  • Wikipedia: Normal Distribution
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

xxxxxxxxxx
​

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>Python: Object-oriented Programming</strong></font>

Python: Object-oriented Programming

xxxxxxxxxx
# What's OOP?

What's OOP?¶

OOP is short for Object-oriented Programming. It's a programming paradigm based on the concept of objects that contain attributes and methods. In Python, concepts we are already very familiar with like integers, strings and dictionaries are all objects. Every object has a type, which defines what you can do with the object. For example, the int type defines what happens when adding something to an integer and what happens when trying to convert it to a string.

[1]:
 
x = 42
print("%d is an object of %s" % (x, type(x)))
​
x = "Hello world!"
print("%s is an object of %s" % (x, type(x)))
42 is an object of <class 'int'>
Hello world! is an object of <class 'str'>
xxxxxxxxxx
An object's **attributes** are its internal variables that are used to store information about the object. For example, in the code below, we define a cat object that has a name and age.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

An object's attributes are its internal variables that are used to store information about the object. For example, in the code below, we define a cat object that has a name and age.WQU WorldQuant University Applied Data Science Lab QQQQ

[2]:
 
class Cat:
    def __init__(self, name, age):
        self.name = name
        self.age = age
​
​
my_cat = Cat("Lily", 3)
​
print(my_cat.name)
print(my_cat.age)
Lily
3
xxxxxxxxxx
An object's **methods** are its internal functions that implement different capabilities. For example, `str` objects come with an `upper` and `lower` method.

An object's methods are its internal functions that implement different capabilities. For example, str objects come with an upper and lower method.

[3]:
 
x = "Zijing"
print(x.lower())
print(x.upper())
zijing
ZIJING
xxxxxxxxxx
# Classes

Classes¶

The full definition of an object is an object's class. We can define our own classes to create objects that carry out a variety of related tasks or represent information in a convenient way. To understand the concept of class better, let's implement a class called Rectangle.

The first thing we'll need Rectangle to do is to be able to create a Rectangle object. Through a special method called __init__, we can define the height and length of the rectangle as attributed. Then we can define methods to calculate the area and perimeter of the rectangle.

xxxxxxxxxx
## Defining a Class

Defining a Class¶

[4]:
 
class Rectangle(object):
    def __init__(self, height, length):
        self.height = height
        self.length = length
​
    def area(self):
        return self.height * self.length
​
    def perimeter(self):
        return 2 * (self.height + self.length)
[5]:
 
# Define a Rectangle class
rectangle = Rectangle(4, 3)
print(rectangle)
<__main__.Rectangle object at 0x7f694d78aeb0>
xxxxxxxxxx
## Attributes

Attributes¶

There are two attributes in this rectangle class: height and length. They are all defined within the __init__ method. When we define a Rectangle object, we define the attributes:

[6]:
 
rectangle.height
[6]:
4
[7]:
 
rectangle.length
[7]:
3
xxxxxxxxxx
## Methods

Methods¶

Including the __init__ method, there are two other methods called area and perimeter, which guide the rectangle object on how to calculate its area and perimeter using its attributes.

[8]:
 
rectangle.area()
[8]:
12
[9]:
 
rectangle.perimeter()
[9]:
14
xxxxxxxxxx
You might have noticed that both of the methods took as a first argument the keyword `self`.  The first argument to any method in a class is the instance of the class upon which the method is being called.  Think of a class like a blueprint from which possibly many objects are built. The `self` argument is the mechanism Python uses so that the method can know which instance of the class it is being called upon.  When the method is actually called, we can call it in two ways.  Let's say we create a class `MyClass` with method `.do_it(self)`, if we instantiate an object from this class, we can call the method in two ways:

You might have noticed that both of the methods took as a first argument the keyword self. The first argument to any method in a class is the instance of the class upon which the method is being called. Think of a class like a blueprint from which possibly many objects are built. The self argument is the mechanism Python uses so that the method can know which instance of the class it is being called upon. When the method is actually called, we can call it in two ways. Let's say we create a class MyClass with method .do_it(self), if we instantiate an object from this class, we can call the method in two ways:

[10]:
 
class MyClass(object):
    def __init__(self, num):
        self.num = num
​
    def do_it(self):
        print(self.num)
​
​
myclass = MyClass(2)
myclass.do_it()
MyClass.do_it(myclass)
2
2
xxxxxxxxxx
In the first way using `myclass.do_it()`, the `self` argument is understood because `myclass` is an instance of `MyClass`.  This is the almost universal way to do call a method.  The other possibility is `MyClass.do_it(myclass)` where we are passing in the object `myclass` as the `self` argument, this syntax is much less common.  

In the first way using myclass.do_it(), the self argument is understood because myclass is an instance of MyClass. This is the almost universal way to do call a method. The other possibility is MyClass.do_it(myclass) where we are passing in the object myclass as the self argument, this syntax is much less common.

Like all Python arguments, there is no need for self to be named self, we could also call it this or apple or wizard. However, the use of self is a very strong Python convention which is rarely broken. You should use this convention so that your code is understood by other people.

xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Define a Square object with only one attribute: side.

[11]:
xxxxxxxxxx
 
class Square(object):
    
​
​
Square
  Cell In [11], line 5
    Square
    ^
IndentationError: expected an indented block
xxxxxxxxxx
Define a method on its area for the `Square` object:

Define a method on its area for the Square object:

[ ]:
xxxxxxxxxx
 
class Square(object):
    
Square
xxxxxxxxxx
Use the class you just defined to calculate the area of a square with side equals to 2.

Use the class you just defined to calculate the area of a square with side equals to 2.

[ ]:
xxxxxxxxxx
 
square = ...
​
xxxxxxxxxx
# Inheritance

Inheritance¶

xxxxxxxxxx
Inheritance means we can define a class that inherits everything from another existing class, including its `method` and `attributes`. Inheritance enables us to create new classes that reuse, extend, and modify the behavior defined in other classes. For example, let's use the `Rectangle` class as an example:

Inheritance means we can define a class that inherits everything from another existing class, including its method and attributes. Inheritance enables us to create new classes that reuse, extend, and modify the behavior defined in other classes. For example, let's use the Rectangle class as an example:

[12]:
 
class Rectangle(object):
    def __init__(self, height, length):
        self.height = height
        self.length = length
​
    def area(self):
        return self.height * self.length
​
    def perimeter(self):
        return 2 * (self.height + self.length)
[13]:
xxxxxxxxxx
 
rectangle = Rectangle(4, 3)
rectangle.area()
[13]:
12
xxxxxxxxxx
We can further define another class called `Shape` that inherits everything from `Rectangle`:

We can further define another class called Shape that inherits everything from Rectangle:

[14]:
xxxxxxxxxx
 
class Shape(Rectangle):
    def is_rectangle(self):
        return True
xxxxxxxxxx
The `Shape` class performs like the `Rectangle` class:

The Shape class performs like the Rectangle class:

[15]:
xxxxxxxxxx
 
shape = Shape(4, 3)
shape.area()
[15]:
12
xxxxxxxxxx
Now we can call the `is_rectangle` method that is not defined the in the `Rectangle` class:

Now we can call the is_rectangle method that is not defined the in the Rectangle class:

[16]:
xxxxxxxxxx
 
shape.is_rectangle()
[16]:
True
xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Define a SquareExtra class that inherits the Square class you just defined (with attribute side and method area), and add a method perimeter to calculate the perimeter of a square given side.

[ ]:
xxxxxxxxxx
 
class Square(object):
    
[ ]:
xxxxxxxxxx
 
class SquareExtra(Square):
    
[ ]:
xxxxxxxxxx
 
square = ...
​
# Use the area() function
​
[ ]:
xxxxxxxxxx
 
# Use the perimeter() function
​
xxxxxxxxxx
# References & Further Reading 

References & Further Reading¶

  • Wikipedia Objected Oriented Programming
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

xxxxxxxxxx
​

Usage Guidelines

This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.

This means:

  • ⓧ No downloading this notebook.
  • ⓧ No re-sharing of this notebook with friends or colleagues.
  • ⓧ No downloading the embedded videos in this notebook.
  • ⓧ No re-sharing embedded videos with friends or colleagues.
  • ⓧ No adding this notebook to public or private repositories.
  • ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.

xxxxxxxxxx
<font size="+3"><strong>APIs</strong></font>

APIs

xxxxxxxxxx
# RESTful APIs

RESTful APIs¶

xxxxxxxxxx
An Application Programming Interface **(API)** is a way for two or more computer programs to communicate with each other — either when they are running on the same computer or when they are running on different computers across the internet. **RESTful API** is a type of API that follows the constraints of REST architectural style and allows for interaction with RESTful web services. REST stands for Representational State Transfer, and API stands for application programming interface.

An Application Programming Interface (API) is a way for two or more computer programs to communicate with each other — either when they are running on the same computer or when they are running on different computers across the internet. RESTful API is a type of API that follows the constraints of REST architectural style and allows for interaction with RESTful web services. REST stands for Representational State Transfer, and API stands for application programming interface.

xxxxxxxxxx
RESTful API is an interface to exchange information securely over the internet. The server needs to first identify the resource with unique resource identifiers. For REST services, the server will use Uniform Resource Locator (URL), which is a path to the resource. A URL is like a website address that you enter into your browser to visit any webpage. It usually contains with useful information that we could extract.

RESTful API is an interface to exchange information securely over the internet. The server needs to first identify the resource with unique resource identifiers. For REST services, the server will use Uniform Resource Locator (URL), which is a path to the resource. A URL is like a website address that you enter into your browser to visit any webpage. It usually contains with useful information that we could extract.

RESTful APIs is implemented by using the Hypertext Transfer Protocol (HTTP). HTTP is a messaging protocol that describes the types and structure of messages that can be used for communication between servers and clients. Communication occurs by clients sending a request to a server and the server replying with a response.

There are 4 kinds of requests clients can send: GET, POST, PUT DELETE:

  • GET: The most common request is GET. Anytime you open a web browser and click a website, you are making a GET request. GET is a request for the server to send the client information. When GET returns an HTTP response code of 200, it means the request is successful. In an error case, it most often returns a 404 (NOT FOUND ERROR) or 400 (BAD REQUEST ERROR).

  • POST: POST is used to send data to the server. They include the data representation with the request. It's also used to when you want the server to execute some sort of operation.

  • PUT: PUT is used to update existing resources on the server.

  • DELETE: DELETE is used to remove the resource.

xxxxxxxxxx
# Making an HTTP Request

Making an HTTP Request¶

xxxxxxxxxx
We can specify the URL and use the `request` library to make HTTP requests:

We can specify the URL and use the request library to make HTTP requests:

[10]:
 
import requests
​
url = "http://www.google.com"
response = requests.get(url)
response
[10]:
<Response [200]>
xxxxxxxxxx
The output of the previous code stands for the status codes of a response, and recall that 200 means the request is successful.

The output of the previous code stands for the status codes of a response, and recall that 200 means the request is successful.

xxxxxxxxxx
As mentioned earlier, we can specify parameters in a URL to get specific information. Take the AlphaVantage API as an example. We can specify which stock we want to extract and how frequently we want to extract it, along with our API key.

As mentioned earlier, we can specify parameters in a URL to get specific information. Take the AlphaVantage API as an example. We can specify which stock we want to extract and how frequently we want to extract it, along with our API key.

[11]:
 
# Get IBM's stock prices
ticker = "IBM"
​
​
url = (
    "https://www.alphavantage.co/query?"
    f"function=TIME_SERIES_DAILY&"
    f"symbol={ticker}&"
    f"apikey=demo"
)
​
url
[11]:
'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=IBM&apikey=demo'
xxxxxxxxxx
Now we have the URL ready, we can extract the information using `request`:

Now we have the URL ready, we can extract the information using request:

[15]:
 
response = requests.get(url=url)
​
# Get the data in json format
data = response.json()
​
# Get the daily stock prices
stock_data = data["Time Series (Daily)"]
xxxxxxxxxx
We will see the stock prices for IBM at any date.

We will see the stock prices for IBM at any date.

[14]:
 
stock_data["2022-10-19"]
[14]:
{'1. open': '122.3600',
 '2. high': '123.9400',
 '3. low': '121.9875',
 '4. close': '122.5100',
 '5. volume': '5906576'}
xxxxxxxxxx
# Building Your Own API

Building Your Own API¶

xxxxxxxxxx
## Declaring Environment Variables

Declaring Environment Variables¶

xxxxxxxxxx
**Environment variables** are variables whose value is defined outside the code, typically through built-in operating system function or in a configuration file. The most common reason to use environment variables is to limit the need to modify and re-release an application just because of the changes in configuration data. Another use case of environment variables is to define API keys in the configuration file due to security concerns. If you are building an application calling different API with API keys, you don't want to show the keys in your code.

Environment variables are variables whose value is defined outside the code, typically through built-in operating system function or in a configuration file. The most common reason to use environment variables is to limit the need to modify and re-release an application just because of the changes in configuration data. Another use case of environment variables is to define API keys in the configuration file due to security concerns. If you are building an application calling different API with API keys, you don't want to show the keys in your code.

xxxxxxxxxx
## Creating a Path

Creating a Path¶

xxxxxxxxxx
There are lots of tools to build an API using Python. We will use the `FastAPI` module to show an example on how to create your own API. We can first set up the import the library and set up an application:

There are lots of tools to build an API using Python. We will use the FastAPI module to show an example on how to create your own API. We can first set up the import the library and set up an application:

[8]:
 
from fastapi import FastAPI
​
app = FastAPI()
xxxxxxxxxx
Next, we create a path for the app to request content from. Then we use the `@app.get` function and specify the path with a forward slash (`/`):

Next, we create a path for the app to request content from. Then we use the @app.get function and specify the path with a forward slash (/):

[9]:
 
@app.get("/ping")
def pong():
    return {"ping": "pong!"}
xxxxxxxxxx
In this example, we define a path called `/ping`. When make a get request to this path, and the app should return what we defined in the `pong()` function. We can use the `requests` library to check the app is working fine.

In this example, we define a path called /ping. When make a get request to this path, and the app should return what we defined in the pong() function. We can use the requests library to check the app is working fine.

Before you run the code below, start your FastAPI server by going to the command line, using the cd command to get to the /@textbook directory, and run the command:

uvicorn main:app --reload --workers 1 --host localhost --port 8009
[16]:
 
import requests
​
url = "http://localhost:8009/ping"
response = requests.get(url=url)
​
print("response code:", response.status_code)
response.json()
response code: 200
[16]:
{'ping': 'pong!'}
xxxxxxxxxx
Note we can also create a `main` file so that we can include more functions for more complicated applications.

Note we can also create a main file so that we can include more functions for more complicated applications.

xxxxxxxxxx
## Data Classes and Type Checking

Data Classes and Type Checking¶

xxxxxxxxxx
When setting up an application, it is very important to ensure the input and output of the application are in the desired format. [pydantic](https://pydantic-docs.helpmanual.io/) is a widely used tool in data validation and type hinting enforcement, which provides user-friendly errors and help catch any invalid data type. Defining an object means creating a new class which inherits from the pydantic `BaseModel`. pydantic guarantees that the fields of the model instance will conform to the field types defined on the model. Let's define a `Person` class that has two attributes, name and age. We then use pydantic `BaseModel` to ensure the name is a string, and age is an integer.

When setting up an application, it is very important to ensure the input and output of the application are in the desired format. pydantic is a widely used tool in data validation and type hinting enforcement, which provides user-friendly errors and help catch any invalid data type. Defining an object means creating a new class which inherits from the pydantic BaseModel. pydantic guarantees that the fields of the model instance will conform to the field types defined on the model. Let's define a Person class that has two attributes, name and age. We then use pydantic BaseModel to ensure the name is a string, and age is an integer.

[ ]:
 
from pydantic import BaseModel
​
​
class Person(BaseModel):
    name: str
    age: int
​
​
person = Person(name="Lily", age=3)
print(person)
xxxxxxxxxx
If we are not putting the desired data format, pydantic will help correct the error:

If we are not putting the desired data format, pydantic will help correct the error:

[ ]:
 
person = Person(name="Lily", age="3")
print(person)
xxxxxxxxxx
pydantic `BaseModel` provides support for most of the common types from the Python standard library. We can define bool, int, float, str, list, tuple, dict, datetime.date, etc

pydantic BaseModel provides support for most of the common types from the Python standard library. We can define bool, int, float, str, list, tuple, dict, datetime.date, etc

xxxxxxxxxx
<font size="+1">Practice</font>

Practice

Try it yourself! Using pydantic BaseModel to create an Apartment class that has three attributes: name, built_year, near_bus_stop in appropriate data types:

[ ]:
 
# Build Apartment class
​
​
​
# Instantiate object
apartment = Apartment(name="riverside", built_year=2007, near_bus_stop=True)
print(apartment)
xxxxxxxxxx
# References & Further Reading 

References & Further Reading¶

  • RESTful API Introduction
  • RESTful API Documents
  • Environment Variable
  • FastAPI example
xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ

xxxxxxxxxx
---

Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.

x
1
from fastapi import FastAPI
2
​
3
app = FastAPI()
4
​
5
​
6
@app.get("/ping")
7
def pong():
8
    return {"ping": "pong!"}
9
​
  • 08-visualization-plotly.ipynb
  • 09-visualization-seaborn.ipynb
  • 10-databases-sql.ipynb
  • 11-databases-mongodb.ipynb
  • 12-ml-core.ipynb
  • 13-ml-data-pre-processing-and-production.ipynb
  • 14-ml-classification.ipynb
  • 15-ml-regression.ipynb
  • 16-ml-unsupervised-learning.ipynb
  • 17-ts-core.ipynb
  • 18-ts-models.ipynb
  • 19-linux-command-line.ipynb
  • 20-statistics.ipynb
  • 21-python-object-oriented-programming.ipynb
  • 22-apis.ipynb
  • jovyan@user-6382c061478c7b001c196a81: ~/work/ds-curriculum/@textbook
  • main.py
W
xxxxxxxxxx
# RESTful APIs
Advanced Tools
xxxxxxxxxx
xxxxxxxxxx

-

Variables

Callstack

    Breakpoints

    Source

    xxxxxxxxxx
    1
    22-apis.ipynb
    • RESTful APIs
    • Making an HTTP Request
    • Building Your Own API
    • Declaring Environment Variables
    • Creating a Path
    • Data Classes and Type Checking
    • References & Further Reading
      1
      15
      Python
      Python 3 (ipykernel) | Idle
      Saving completed
      Uploading…
      22-apis.ipynb
      English (United States)
      Spaces: 4
      Ln 1, Col 1
      Mode: Command
      • Console
      • Change Kernel…
      • Clear Console Cells
      • Close and Shut Down…
      • Insert Line Break
      • Interrupt Kernel
      • New Console
      • Restart Kernel…
      • Run Cell (forced)
      • Run Cell (unforced)
      • Show All Kernel Activity
      • Debugger
      • Continue
        Continue
        F9
      • Evaluate Code
        Evaluate Code
      • Next
        Next
        F10
      • Step In
        Step In
        F11
      • Step Out
        Step Out
        Shift+F11
      • Terminate
        Terminate
        Shift+F9
      • Extension Manager
      • Enable Extension Manager
      • File Operations
      • Autosave Documents
      • Open from Path…
        Open from path
      • Reload Notebook from Disk
        Reload contents from disk
      • Revert Notebook to Checkpoint
        Revert contents to previous checkpoint
      • Save Notebook
        Save and create checkpoint
        Ctrl+S
      • Save Notebook As…
        Save with new path
        Ctrl+Shift+S
      • Show Active File in File Browser
      • Trust HTML File
      • Help
      • About JupyterLab
      • Jupyter Forum
      • Jupyter Reference
      • JupyterLab FAQ
      • JupyterLab Reference
      • Launch Classic Notebook
      • Licenses
      • Markdown Reference
      • Reset Application State
      • Image Viewer
      • Flip image horizontally
        H
      • Flip image vertically
        V
      • Invert Colors
        I
      • Reset Image
        0
      • Rotate Clockwise
        ]
      • Rotate Counterclockwise
        [
      • Zoom In
        =
      • Zoom Out
        -
      • Kernel Operations
      • Shut Down All Kernels…
      • Launcher
      • New Launcher
      • Main Area
      • Activate Next Tab
        Ctrl+Shift+]
      • Activate Next Tab Bar
        Ctrl+Shift+.
      • Activate Previous Tab
        Ctrl+Shift+[
      • Activate Previous Tab Bar
        Ctrl+Shift+,
      • Activate Previously Used Tab
        Ctrl+Shift+'
      • Close All Other Tabs
      • Close All Tabs
      • Close Tab
        Alt+W
      • Close Tabs to Right
      • Find Next
        Ctrl+G
      • Find Previous
        Ctrl+Shift+G
      • Find…
        Ctrl+F
      • Log Out
        Log out of JupyterLab
      • Presentation Mode
      • Show Header Above Content
      • Show Left Sidebar
        Ctrl+B
      • Show Log Console
      • Show Right Sidebar
      • Show Status Bar
      • Shut Down
        Shut down JupyterLab
      • Simple Interface
        Ctrl+Shift+D
      • Notebook Cell Operations
      • Change to Code Cell Type
        Y
      • Change to Heading 1
        1
      • Change to Heading 2
        2
      • Change to Heading 3
        3
      • Change to Heading 4
        4
      • Change to Heading 5
        5
      • Change to Heading 6
        6
      • Change to Markdown Cell Type
        M
      • Change to Raw Cell Type
        R
      • Clear Outputs
      • Collapse All Code
      • Collapse All Outputs
      • Collapse Selected Code
      • Collapse Selected Outputs
      • Copy Cells
        C
      • Cut Cells
        X
      • Delete Cells
        D, D
      • Disable Scrolling for Outputs
      • Enable Scrolling for Outputs
      • Expand All Code
      • Expand All Outputs
      • Expand Selected Code
      • Expand Selected Outputs
      • Extend Selection Above
        Shift+K
      • Extend Selection Below
        Shift+J
      • Extend Selection to Bottom
        Shift+End
      • Extend Selection to Top
        Shift+Home
      • Insert Cell Above
        A
      • Insert Cell Below
        B
      • Merge Cell Above
        Ctrl+Backspace
      • Merge Cell Below
        Ctrl+Shift+M
      • Merge Selected Cells
        Shift+M
      • Move Cells Down
      • Move Cells Up
      • Paste Cells Above
      • Paste Cells and Replace
      • Paste Cells Below
        V
      • Redo Cell Operation
        Shift+Z
      • Run Selected Cells
        Shift+Enter
      • Run Selected Cells and Don't Advance
        Ctrl+Enter
      • Run Selected Cells and Insert Below
        Alt+Enter
      • Run Selected Text or Current Line in Console
      • Select Cell Above
        K
      • Select Cell Below
        J
      • Split Cell
        Ctrl+Shift+-
      • Undo Cell Operation
        Z
      • Notebook Operations
      • Change Kernel…
      • Clear All Outputs
      • Close and Shut Down
      • Collapse All Cells
      • Deselect All Cells
      • Enter Command Mode
        Ctrl+M
      • Enter Edit Mode
        Enter
      • Expand All Headings
      • Interrupt Kernel
      • New Console for Notebook
      • New Notebook
        Create a new notebook
      • Reconnect To Kernel
      • Render All Markdown Cells
      • Restart Kernel and Clear All Outputs…
      • Restart Kernel and Run All Cells…
      • Restart Kernel and Run up to Selected Cell…
      • Restart Kernel…
      • Run All Above Selected Cell
      • Run All Cells
      • Run Selected Cell and All Below
      • Select All Cells
        Ctrl+A
      • Toggle All Line Numbers
        Shift+L
      • Toggle Collapse Notebook Heading
        T
      • Trust Notebook
      • Settings
      • Advanced Settings Editor
        Ctrl+,
      • Show Contextual Help
      • Show Contextual Help
        Live updating code documentation from the active kernel
        Ctrl+I
      • Spell Checker
      • Choose spellchecker language
      • Toggle spellchecker
      • Terminal
      • Decrease Terminal Font Size
      • Increase Terminal Font Size
      • New Terminal
        Start a new terminal session
      • Refresh Terminal
        Refresh the current terminal session
      • Use Terminal Theme: Dark
        Set the terminal theme
      • Use Terminal Theme: Inherit
        Set the terminal theme
      • Use Terminal Theme: Light
        Set the terminal theme
      • Text Editor
      • Decrease Font Size
      • Increase Font Size
      • Indent with Tab
      • New Markdown File
        Create a new markdown file
      • New Python File
        Create a new Python file
      • New Text File
        Create a new text file
      • Spaces: 1
      • Spaces: 2
      • Spaces: 4
      • Spaces: 8
      • Theme
      • Decrease Code Font Size
      • Decrease Content Font Size
      • Decrease UI Font Size
      • Increase Code Font Size
      • Increase Content Font Size
      • Increase UI Font Size
      • Theme Scrollbars
      • Use Theme: JupyterLab Dark
      • Use Theme: JupyterLab Light